Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code


Chronological Thread 
  • From: Lev-Arie Ratinov <arie.ratinov AT gmail.com>
  • To: Tim Dawborn <tim.dawborn AT gmail.com>, "arie.ratinov" <arie.ratinov AT gmail.com>, "cogcomp AT cs.uiuc.edu" <cogcomp AT cs.uiuc.edu>
  • Cc: "illinois-ml-nlp-users AT cs.uiuc.edu" <illinois-ml-nlp-users AT cs.uiuc.edu>
  • Subject: Re: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code
  • Date: Mon, 14 Nov 2011 15:25:20 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Hi Tim. Thanks for the pointer. Indeed, it is a bug, and one which
affects the performance of the system. (if I'm using only gazetteers
and fixing the bug with the null check, I'm getting a 0.2 F1
improvement : 87.3 to 87.5). I'll fix this bug with the if check
you've proposed.

Thank you very much!

On 11/6/11, Lev-Arie Ratinov
<arie.ratinov AT gmail.com>
wrote:
> Hi Tim. thanks for the excellent debugging info. I'll look into that.
> I remember adding this line on purpose, because otherwise,
> when you're calling annotate multiple times, the vectors grow
> and you'll get out of memory exception if you're not extremely
> careful. Indeed, I tried to be careful with avoiding the scenarios
> you've been describing, but it's possible I have a bug.
>
> I'll look into that and reply soon.
>
> On 11/6/11, Tim Dawborn
> <tim.dawborn AT gmail.com>
> wrote:
>> Hi,
>>
>> I'm a PhD student from the University of Sydney. Reading through the
>> source
>> of the LBJ NER multi-word gazetteers, I'm after some clarification about a
>> particular piece of the code. Either it is a bug, or it is the behaviour
>> you
>> were expecting but it was not what I was expecting.
>>
>> Using the source downloaded from your website this morning
>> (LBJNERTagger1.2.zip), the file in question is
>> src/LbjTagger/Gazzetteers.java. Line 75 innocently reads:
>>
>> 73 public static void annotate(NEWord w)
>> 74 {
>> 75 w.gazetteers=new Vector<String>();
>> 76
>>
>> My question here is whether or not this should have a "if (w.gazetteers !=
>> null)" check beforehand. The reason for this, is that in the multi-word
>> lookahead (lines 103 - 125 for example), the gazetteers member on the
>> variable temp, which will be a future word, is assigned and populated with
>> values. These values currently will be lost when the future word is then
>> passed to annotate, as line 75 blasts over the potentially existing
>> vector.
>>
>> As an example, consider the bigram "Phil Simmons". This bigram appears in
>> the WikiPeople.lst gazetteer file.
>>
>> tim@tim-macbook:~/Downloads/LbjNerTagger1.11.release$
>> grep -n '^Phil
>> Simmons$' Data/KnownLists/WikiPeople.lst
>> 144174:Phil Simmons
>>
>> Using the source downloaded, when tagging the word "Phil" when i == 1 and
>> loc == 1, the variable temp (pointing to the word "Simmons") gets its
>> gazetteers member assigned to a new vector instance, and then populated
>> with
>> "I-WikiPeople.lst". This can be confirmed through the addition of
>> debugging
>> information. Later, when annotate gets invoked for the word "Simmons", the
>> vector instance which is already assigned to the gazetteers member gets
>> overwritten and the existing value gets lost. This also can be confirmed
>> with debugging information, à la:
>>
>> 73 public static void annotate(NEWord w)
>> 74 {
>> 75 if (w.gazetteers != null) {
>> 76 System.out.print("Overwriting existing gazetteer for word '" +
>> w.form + "':");
>> 77 for (String s : w.gazetteers)
>> 78 System.out.print(" " + s);
>> 79 System.out.println();
>> 80 }
>> 81 w.gazetteers=new Vector<String>();
>>
>> which when executed, produces the output:
>>
>> Overwriting existing gazetteer for word 'Simmons': L-WikiPeople.lst
>> L-WikiPeople.lst(IC)
>>
>> So, my question thus becomes is this the expected behaviour of your
>> multiword gazetteer algorithm, or is this a bug in the implementation? It
>> seems to me like a bug.
>>
>> Thanks in advance :)
>>
>>
>> Tim
>> http://sydney.edu.au/it/~tdaw3088/
>>
>>
>
>
> --
> Peace&Love
>
> _______________________________________________
> illinois-ml-nlp-users mailing list
> illinois-ml-nlp-users AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users
>


--
Peace&Love





Archive powered by MHonArc 2.6.16.

Top of Page