Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code


Chronological Thread 
  • From: Lev-Arie Ratinov <arie.ratinov AT gmail.com>
  • To: Tim Dawborn <tim.dawborn AT gmail.com>, "arie.ratinov" <arie.ratinov AT gmail.com>
  • Cc: "illinois-ml-nlp-users AT cs.uiuc.edu" <illinois-ml-nlp-users AT cs.uiuc.edu>
  • Subject: Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code
  • Date: Sun, 6 Nov 2011 10:36:48 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Hi Tim. thanks for the excellent debugging info. I'll look into that.
I remember adding this line on purpose, because otherwise,
when you're calling annotate multiple times, the vectors grow
and you'll get out of memory exception if you're not extremely
careful. Indeed, I tried to be careful with avoiding the scenarios
you've been describing, but it's possible I have a bug.

I'll look into that and reply soon.

On 11/6/11, Tim Dawborn
<tim.dawborn AT gmail.com>
wrote:
> Hi,
>
> I'm a PhD student from the University of Sydney. Reading through the source
> of the LBJ NER multi-word gazetteers, I'm after some clarification about a
> particular piece of the code. Either it is a bug, or it is the behaviour you
> were expecting but it was not what I was expecting.
>
> Using the source downloaded from your website this morning
> (LBJNERTagger1.2.zip), the file in question is
> src/LbjTagger/Gazzetteers.java. Line 75 innocently reads:
>
> 73 public static void annotate(NEWord w)
> 74 {
> 75 w.gazetteers=new Vector<String>();
> 76
>
> My question here is whether or not this should have a "if (w.gazetteers !=
> null)" check beforehand. The reason for this, is that in the multi-word
> lookahead (lines 103 - 125 for example), the gazetteers member on the
> variable temp, which will be a future word, is assigned and populated with
> values. These values currently will be lost when the future word is then
> passed to annotate, as line 75 blasts over the potentially existing vector.
>
> As an example, consider the bigram "Phil Simmons". This bigram appears in
> the WikiPeople.lst gazetteer file.
>
> tim@tim-macbook:~/Downloads/LbjNerTagger1.11.release$
> grep -n '^Phil
> Simmons$' Data/KnownLists/WikiPeople.lst
> 144174:Phil Simmons
>
> Using the source downloaded, when tagging the word "Phil" when i == 1 and
> loc == 1, the variable temp (pointing to the word "Simmons") gets its
> gazetteers member assigned to a new vector instance, and then populated with
> "I-WikiPeople.lst". This can be confirmed through the addition of debugging
> information. Later, when annotate gets invoked for the word "Simmons", the
> vector instance which is already assigned to the gazetteers member gets
> overwritten and the existing value gets lost. This also can be confirmed
> with debugging information, à la:
>
> 73 public static void annotate(NEWord w)
> 74 {
> 75 if (w.gazetteers != null) {
> 76 System.out.print("Overwriting existing gazetteer for word '" +
> w.form + "':");
> 77 for (String s : w.gazetteers)
> 78 System.out.print(" " + s);
> 79 System.out.println();
> 80 }
> 81 w.gazetteers=new Vector<String>();
>
> which when executed, produces the output:
>
> Overwriting existing gazetteer for word 'Simmons': L-WikiPeople.lst
> L-WikiPeople.lst(IC)
>
> So, my question thus becomes is this the expected behaviour of your
> multiword gazetteer algorithm, or is this a bug in the implementation? It
> seems to me like a bug.
>
> Thanks in advance :)
>
>
> Tim
> http://sydney.edu.au/it/~tdaw3088/
>
>


--
Peace&Love





Archive powered by MHonArc 2.6.16.

Top of Page