Skip to Content.
Sympa Menu

charm - Re: [charm] Deadlock detection

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Deadlock detection


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>, charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Deadlock detection
  • Date: Wed, 27 Jun 2018 11:32:16 -0500
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=unmobile AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu

Hi Jozsef,

You could modify your objects' existing control flow to set a flag or counter indicating where they are, and add a 'debugging output' method to them to print those counters/flags that would get called by the QD callback.

For the development team, this suggests a simplistic addition to SDAG to have an entry method that will dump the object's SDAG state, so that users would only have to write their own if they need something more specific. Additionally, this suggests a useful feature idea for CharmDebug in being able to examine this state interactively. Drawing from Alinea's tools, it would be good to visualize and cluster objects' SDAG state, making it easy to examine and compare representatives and outliers.

Phil

On Wed, Jun 27, 2018 at 10:58 AM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
Hi Vinicius,

This is useful. I have put in a quiescence detection callback which does not
appear to be triggered usually, i.e., most of the time. But I managed to
trigger it running a bunch of regression tests, scheduled in parallel on a
shared-memory machine, in an infinite shell-while loop and also randomly load
all CPUs in the background + also using Charm++ randomized message queues.  I'm
assuming that triggering quiescence in this scenario means that I successfully
hit an error that would otherwise lead to a deadlock. (This might very well be
a conjecture with some wishful thinking.)

Am I correct, that quiescence should not happen in a "normal" application that
is not specifically designed with quiescence in mind?

Also, for more information, I built Charm++ with the following build flags:

--enable-error-checking --with-prio-type=int --enable-randomized-msgq --suffix randq-debug

So assuming I'm on the right path, now I would like to get a more meaningful
state out of my application besides the trace below. Does anyone have any clue
how to get perhaps an entry method name into such a trace?

=======================================================================
Stack trace (most recent call last):
#23   Object "/lib/x86_64-linux-gnu/ld-2.27.so", at 0xffffffffffffffff, in
#22   Object ".../Main/inciter", at 0x4c13a9, in _start
#21   Source "../csu/libc-start.c", line 310, in __libc_start_main
#20   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/main.C", line 9, in main
          6:   int stack_top=0;
          7:   memory_stack_top = &stack_top;
          8:
      >   9:   ConverseInit(argc, argv, (CmiStartFn) _initCharm, 0, 0);
         10:
         11:   return 0;
         12: }
#19   Source "./machine-common-core.c", line 1432, in ConverseInit
#18   Source "./machine-common-core.c", line 1539, in ConverseRunPE
#17   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line 1847, in CsdScheduler
       1845: int CsdScheduler(int maxmsgs)
       1846: {
      >1847:    if (maxmsgs<0) CsdScheduleForever();
       1848:    else if (maxmsgs==0)
       1849:            CsdSchedulePoll();
       1850:    else /*(maxmsgs>0)*/
#16   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line 1911, in CsdScheduleForever
       1908:     msg = CsdNextMessage(&state);
       1909:     if (msg!=NULL) { /*A message is available-- process it*/
       1910:       if (isIdle) {isIdle=0;CsdEndIdle();}
      >1911:       SCHEDULE_MESSAGE
       1912:
       1913:       #if CMK_CELL
       1914:         if (progressCount <= 0) {
#15   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line 1663, in CmiHandleMessage
       1660:           CmiAbort("Msg handler does not exist, possible race condition during init\n");
       1661:         }
       1662: #endif
      >1663:    (h->hdlr)(msg,h->userPtr);
       1664: #if CMK_TRACE_ENABLED
       1665:    /* setMemoryStatus(0) */ /* charmdebug */
       1666:    //_LOG_E_HANDLER_END(handlerIdx);       /* projector */
#14   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C", line 1329, in _processHandler
       1326:     case ForChareMsg :
       1327:       TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with msg type: ForChareMsg\n", CkMyPe());)
       1328:       if(env->isPacked()) CkUnpackMessage(&env);
      >1329:       _processForPlainChareMsg(ck,env);
       1330:       break;
       1331:     case ForVidMsg   :
       1332:       TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with msg type: ForVidMsg\n", CkMyPe());)
#13   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C", line 1025, in _processForPlainChareMsg
       1022:     obj = env->getObjPtr();
       1023: #endif
       1024:   }
      >1025:   _invokeEntry(epIdx,env,obj);
       1026:   _STATS_RECORD_PROCESS_MSG_1();
       1027: }
#12   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C", line 652, in _invokeEntry
        649:     _TRACE_BEGIN_EXECUTE(env, obj);
        650:     if(_entryTable[epIdx]->appWork)
        651:         _TRACE_BEGIN_APPWORK();
      > 652:     _invokeEntryNoTrace(epIdx,env,obj);
        653:     if(_entryTable[epIdx]->appWork)
        654:         _TRACE_END_APPWORK();
        655:     _TRACE_END_EXECUTE();
#11   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C", line 641, in _invokeEntryNoTrace
        638: {
        639:   void *msg = EnvToUsr(env);
        640:   _SET_USED(env, 0);
      > 641:   CkDeliverMessageFree(epIdx,msg,obj);
        642: }
        643:
        644: static inline void _invokeEntry(int epIdx,envelope *env,void *obj)
#10   Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C", line 597, in CkDeliverMessageFree
        594: #if CMK_CHARMDEBUG
        595:   CpdBeforeEp(epIdx, obj, msg);
        596: #endif
      > 597:   _entryTable[epIdx]->call(msg, obj);
        598: #if CMK_CHARMDEBUG
        599:   CpdAfterEp(epIdx);
        600: #endif
#9    Source "RNGTest/../Main/inciter.def.h", line 290, in _call_quiescence_void
#8    Source "../Main/Inciter.C", line 216, in quiescence
        213:     }
        214:
        215:     void quiescence() {
      > 216:       Throw( "Quiescence detected" );
        217:     }
        218:                                               
        219:   private:                                   
#7    Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c294ae, in __cxa_throw
#6    Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c29515, in
#5    Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c26762, in
#4    Source "../Base/ProcessException.C", line 79, in __invoke
         76: // *****************************************************************************
         77: {                                             
         78:   // override Charm++'s terminate handler     
      >  79:   std::set_terminate( [](){                   
         80:     tk::Exception("Terminate was called").handleException();
         81:     CkAbort( "Signal caught" );               
         82:   } );                                       
#3    Source "../Base/ProcessException.C", line 80, in operator()
         77: {                                             
         78:   // override Charm++'s terminate handler     
         79:   std::set_terminate( [](){                   
      >  80:     tk::Exception("Terminate was called").handleException();
         81:     CkAbort( "Signal caught" );               
         82:   } );                                       
#2    Source "../Base/Exception.C", line 193, in handleException
        190:   #ifdef HAS_BACKWARD                         
        191:   fprintf( stderr, ">>>\n>>> =========== STACK TRACE ==========\n>>>\n" );
        192:   using namespace backward;                   
      > 193:   StackTrace st; st.load_here(64);           
        194:   Printer p; p.print( st, stderr );           
        195:   fprintf( stderr, ">>>\n>>> ======= END OF STACK TRACE =======\n>>>\n" );
        196:   #endif                                     
#1    Source ".../clang-x86_64/include/backward.hpp", line 750, in load_here
        747:                    return 0;                 
        748:            }                                 
        749:            _stacktrace.resize(depth);         
      > 750:            size_t trace_cnt = details::unwind(callback(*this), depth);
        751:            _stacktrace.resize(trace_cnt);     
        752:            skip_n_firsts(0);                 
        753:            return size();                     
#0    Source ".../clang-x86_64/include/backward.hpp", line 734, in unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::callback>
        731: template <typename F>                         
        732: size_t unwind(F f, size_t depth) {           
        733:    Unwinder<F> unwinder;                     
      > 734:    return unwinder(f, depth);                 
        735: }                                             
        736:                                               
        737: } // namespace details                       
=======================================================================

Thanks,
Jozsef

On 06.26.2018 21:51, Vinicius Freitas wrote:
> Jozsef,
>
> That depends on what is a deadlock for you, in your specific application.
> There is a reason for which Quiescence Detection is not usually recommended.
> Entry methods don't usually "wait to be called". If in your specific
> application you are specifically waiting for something to happen inside an
> entry method, then no, quiescence will not happen.
>
> So, for your first question, a deadlock may or may not be a quiescence,
> depending on what is causing your deadlock, investigating this would be
> important for you, if it is not viable to completely eliminate it.
> As for your second question, I've only used quiescence detection with
> callbacks in Charm++, this way, I would only know where I am at the
> application execution flow,
> but information on what happened would only be available on data that would
> have been manipulated beforehand, during your operations that caused a
> deadlock.
>
>
> Hope I was of help,
> Best regards,
>
> 2018-06-26 21:41 GMT-03:00 Jozsef Bakosi <jbakosi AT lanl.gov>:
>
> > Thanks, Vinicius,
> >
> > I also thought about quiescence detection, but the manual says:
> >
> > "In Charm++, quiescence is defined as the state in which no processor is
> > executing an entry point, no messages are awaiting processing, and there
> > are no
> > messages in-flight."
> >
> > Does a deadlock qualify as quiescence? I think if there is a deadlock,
> > messages
> > may be awaiting processing, i.e., entry methods are waiting to be called,
> > but
> > for some reason they are not called. In other words, would quiescence
> > happen for
> > a deadlock?
> >
> > Also even if the answer is yes, how and what kind of information do I get
> > at the
> > application level after the my function is called after quiescence? How
> > would
> > that help debugging what lead to that function call?
> >
> > Thanks,
> > Jozsef
> >
> > On 06.26.2018 21:06, Vinicius Freitas wrote:
> > > Jozsef,
> > >
> > > Charm has a Quiescence detection mechanism that might help you.  As soon
> > as
> > > you start a Quiescence detection in Charm, you will also declare to which
> > > method it will reduce to (and then, every chare will execute), and it
> > will
> > > only be triggered once NOTHING is happening on the system. Would this
> > help you
> > > in any way?
> > >
> > > Vinicius F.
> > >
> > > Em ter, 26 de jun de 2018 20:04, Jozsef Bakosi <jbakosi AT lanl.gov>
> > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > Time to time I run into asynchronous logic errors (that I'm pretty
> > sure are
> > > > my fault) that non-deterministically produce deadlocks without an error
> > > > message.
> > > >
> > > > I wonder what tools I have available that I can use to diagnose such
> > > > problems.
> > > >
> > > > Somehow, it would be great if I could detect from the runtime system
> > that
> > > > some messages are being waited on, but nothing really happens in which
> > case
> > > > I could dump some messages and their labels/ids/etc, that would help
> > > > identify at least what entry methods are waiting for messages, or
> > something
> > > > similar.
> > > >
> > > > What do you suggest? How do you deal with such errors?
> > > >
> > > > Thanks,
> > > > Jozsef




Archive powered by MHonArc 2.6.19.

Top of Page