Skip to Content.
Sympa Menu

charm - Re: [charm] Fwd: LBDB Migrate "Handle no longer registered"

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fwd: LBDB Migrate "Handle no longer registered"


Chronological Thread 
  • From: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
  • To: "Chandrasekar, Kavitha" <kchndrs2 AT illinois.edu>
  • Cc: Nitin Bhat <nitin.bhat.k AT gmail.com>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fwd: LBDB Migrate "Handle no longer registered"
  • Date: Wed, 20 Jun 2018 23:53:19 -0300
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=vinimmbb AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dmarc=pass header.from=gmail.com

Hi, Nitin, Kavitha, and team,

Thank you for the feedback. I will try to stop this kind of issue in the future. In the next week, I'll remove duplicates from the data structure, being sure they are what is being duplicated.
Of course this should not happen at all, if the strategy was working as it should. Thank you guys again for the feedback.

Sincerely,

-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Graduate Student (Aluno de pós-graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT posgrad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 996 163 803

2018-06-20 13:05 GMT-03:00 Chandrasekar, Kavitha <kchndrs2 AT illinois.edu>:
Hi Vinicius, 

Thanks for providing the details on the load balancer. Based on some debugging, as you suggest, some chare objects are being added multiple times to the LBMigrateMsg to be migrated. The error 'Handle no longer registered’ could occur when Migrate is called on an object that has already been migrated and its destructor has been called. It might be helpful to check if some chare objects are being migrated multiple times in the load balancer.

Thanks,
Kavitha

From: Nitin Bhat <nitin.bhat.k AT gmail.com>
Reply-To: Nitin Bhat <nitin.bhat.k AT gmail.com>
Date: Tuesday, June 19, 2018 at 4:15 PM
To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Subject: [charm] Fwd: LBDB Migrate "Handle no longer registered"

Forwarding Vinicius's load balancer to the mailing list for others to evaluate and debug. 

Thanks,
Nitin

---------- Forwarded message ---------
From: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
Date: Fri, Jun 15, 2018, 12:32 PM
Subject: Re: [charm] LBDB Migrate "Handle no longer registered"
To: Nitin Bhat <nitin.bhat.k AT gmail.com>


Nitin,

Thank you for your attention.
About your questions:
No, I am not calling either of those functions, and haven't used CkLocRec or _verifyAckRequestHandler.
I am not aware of places this could have happened, maybe I am replicating a migration? That could cause this kind of problem?

Picture this case:
1. Task 0 is added to the migration message.
2. Task 1 is added to the migration message.
3. Task 0 is added again.

Later, processing the migrations they are lost?

I have been working in this git repository:
https://github.com/viniciusmctf/charm-affaware-dist-lb

To achieve this result, I have executed this load balancer with the following call:
./charmrun +p7 ./lb_test 15000 151 1 2 10 1000 mesh2d +balancer DistNeighborsLB +LBDebug 2

The lb_test is the one available in Charm's v6.8.1 release, my machine is a Core i7 7th gen with 8GB of RAM, on a Debian-based Linux Mint system.
If you need any more info to reproduce this problem, please let me know.

I wasn't able to find the problem so far, but the code has plenty of comments on its behavior.
Again, I appreciate your help, and if I can do anything else to solve this issue, please, let me know :)

Sincerely,

-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Graduate Student (Aluno de pós-graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 996 163 803

2018-06-15 14:09 GMT-03:00 Nitin Bhat <nitin.bhat.k AT gmail.com>:
Hi Vinicius, 

Thanks for reaching out. 

Looking at the code, it looks like the handle is being unregistered (or deleted) and then the method LBDB::Migrate is called. 

For that reason, I wanted to check with you about the following: 

1. By any chance, are you explicitly calling LDUnregisterObj or UnregisterObj from your load balancer? 

2. If not, there are two places in the charm++ runtime from where this gets called (_verifyAckRequestHandler and CkLocRec destructor). I am guessing in the sequence of calls, somewhere in the program execution, the Migrate method is called after a call to one of the unregister methods. We can help you further if you can send us your load balancer code and also provide us the steps to replicate this issue that you’re facing.

Thanks,
Nitin Bhat
Software Engineer, 
Charmworks, Inc.

On Jun 15, 2018, at 9:51 AM, Vinicius Freitas <vinicius.mct.freitas AT gmail.com> wrote:

Hello, team,

A brief presentation here. I am Vinicius, and I have been working for a few years developing load balancing strategies for Charm++, but today, I found an unusual error (apparently related with LBDBManager.C) which I cannot seem to solve.
I am currently working in a distributed load balancing strategy using the DistBaseLB class, on the v6.8.1 release branch.

After my strategy finishes its work, back on the DistBaseLB workflow, committing the migrations for the LBDB, this lines appeared and confused me:
"[1] LBDB::Migrate: Handle 346 no longer registered, range 0-2188
[1] LBDB::Migrate: Handle 751 no longer registered, range 0-2188
[1] LBDB::Migrate: Handle 1738 no longer registered, range 0-2188
[1] LBDB::Migrate: Handle 1078 no longer registered, range 0-2188"

This happened in a load balancing call of the "lb_test" benchmark. I wasn't able to figure out the reason for this to happen. I found this specific call to be on line 318 of LBDBManager.C, called after a verification on the existence of a handler for a given object.

My question is: in this specific case, in which the handler IS in range, but is no longer registered, what should this mean? My load balancer receives the DistBaseLB::stats structure as a const value, and it is not altered, I simply read the registered LDCommData and LDObjData structures I receive from the parent class.

Thank you for your time, team,
I appreciate any assistance you can provide,
Sincerely,
-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Graduate Student (Aluno de pós-graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT posgrad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 996 163 803






Archive powered by MHonArc 2.6.19.

Top of Page