Skip to Content.
Sympa Menu

charm - Re: [charm] Lustre failure with CkIO

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Lustre failure with CkIO


Chronological Thread 
  • From: Tom Quinn <trq AT astro.washington.edu>
  • To: Ronak Buch <rabuch2 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Lustre failure with CkIO
  • Date: Fri, 8 Feb 2019 16:41:24 -0800 (PST)
  • Authentication-results: illinois.edu; spf=none smtp.mailfrom=trq AT astro.washington.edu; dmarc=none

Here is some info about Pleiades. It looks like the Lustre API has been incompatibly updated.

Date: Fri, 8 Feb 2019 16:15:22 -0800
From: Mahmoud Hanafi
<mahmoud.hanafi AT nasa.gov>
To: Matt F. Cary
<matt.cary AT nasa.gov>,

mhanafi AT nas.nasa.gov
Cc:
trq AT astro.washington.edu
Subject: Re: Could you look at INC000000268615 : Lustre
llapi_file_get_stripe() issue

Forgot to mention the example code will not work with progress file layout, nbp10-16 and soon to be nbp2 and nbp8. He would need to use llapi_layout_get_by_path() (http://wiki.lustre.org/PFL2_High_Level_Design)


Tom Quinn Astronomy, University of Washington
Internet:
trq AT astro.washington.edu
Phone: 206-685-9009

On Wed, 6 Feb 2019, Tom Quinn wrote:

I compiled on a compute node to avoid differences. Further investigation shows the following issues:
1) If you run on a non-Lustre filesystem (say, because you are running a test program in the source directory) then llapi_file_get_stripe() returns the "ENOTTY" errno. I propose that this should be caught and CkGetFileStripeSize() should return the value used for non-lustre systems.

2) If the file doesn't exist, the Pleiades system is returning errno 61 "Connection refused" instead of ENOENT. I'm not sure what to do here.

3) Even worse: llapi_file_get_stripe() is returning "Connection refused" if called on an existing directory.

Here is the program (from configure.ac) that I'm using to test this out:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <lustre/lustreapi.h>
#include <lustre/lustre_user.h>

static inline int maxInt(int a, int b) {
return a > b ? a : b;
}

static void* alloc_lum() {
int v1, v3;
v1 = sizeof(struct lov_user_md_v1) +
LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1);
v3 = sizeof(struct lov_user_md_v3) +
LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1);

return malloc(maxInt(v1, v3));
}

int main() {
llapi_printf(LLAPI_MSG_NORMAL, "Lustre FS is available\n");
{
struct lov_user_md *lump = NULL;
lump = alloc_lum();
int rc = llapi_file_get_stripe(".", lump);
printf("getstripe rc: %d, errno %d\n", rc, errno);
}

return 0;
}

Tom Quinn Astronomy, University of Washington
Internet:
trq AT astro.washington.edu
Phone: 206-685-9009

On Mon, 4 Feb 2019, Ronak Buch wrote:

Interesting, I've never seen that code path be a problem before, it should
be doing the checks to ensure that the correct Lustre APIs exist before
enabling it. Do you know if the configuration of the login and compute
nodes
is different with regards to Lustre?

The most foolproof way to turn it off is to just change line 4 of
fs_parameters.c from #if CMK_HAS_LUSTREFS to #if 0.

On Mon, Feb 4, 2019 at 10:47 PM Tom Quinn
<trq AT astro.washington.edu>
wrote:
I'm getting "errno 25" out of the llapi_file_get_stripe() in
io/fs_parameters.c:CkGetFileStripeSize().  ("not a typewriter"?)

This is on the "Pleiades" machine at NASA Ames.

1) Any ideas about how to fix this?
2) How do I avoid using the Lustre code in Charm, and therefore
work
around this problem?

Tom Quinn       Astronomy, University of Washington
Internet:       
trq AT astro.washington.edu
Phone:          206-685-9009






Archive powered by MHonArc 2.6.19.

Top of Page