Skip to Content.
Sympa Menu

illinois-ml-nlp-users - [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

[Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code


Chronological Thread 
  • From: Tim Dawborn <tim.dawborn AT gmail.com>
  • To: illinois-ml-nlp-users AT cs.uiuc.edu
  • Subject: [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code
  • Date: Sun, 6 Nov 2011 21:34:00 +1100
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Hi,

I'm a PhD student from the University of Sydney. Reading through the source of the LBJ NER multi-word gazetteers, I'm after some clarification about a particular piece of the code. Either it is a bug, or it is the behaviour you were expecting but it was not what I was expecting.

Using the source downloaded from your website this morning (LBJNERTagger1.2.zip), the file in question is src/LbjTagger/Gazzetteers.java. Line 75 innocently reads:

 73   public static void annotate(NEWord w)
 74   {
 75     w.gazetteers=new Vector<String>();
 76     

My question here is whether or not this should have a "if (w.gazetteers != null)" check beforehand. The reason for this, is that in the multi-word lookahead (lines 103 - 125 for example), the gazetteers member on the variable temp, which will be a future word, is assigned and populated with values. These values currently will be lost when the future word is then passed to annotate, as line 75 blasts over the potentially existing vector.

As an example, consider the bigram "Phil Simmons". This bigram appears in the WikiPeople.lst gazetteer file.

tim@tim-macbook:~/Downloads/LbjNerTagger1.11.release$ grep -n '^Phil Simmons$' Data/KnownLists/WikiPeople.lst 
144174:Phil Simmons

Using the source downloaded, when tagging the word "Phil" when i == 1 and loc == 1, the variable temp (pointing to the word "Simmons") gets its gazetteers member assigned to a new vector instance, and then populated with "I-WikiPeople.lst". This can be confirmed through the addition of debugging information. Later, when annotate gets invoked for the word "Simmons", the vector instance which is already assigned to the gazetteers member gets overwritten and the existing value gets lost. This also can be confirmed with debugging information, à la:

 73   public static void annotate(NEWord w)
 74   { 
 75     if (w.gazetteers != null) {
 76       System.out.print("Overwriting existing gazetteer for word '" + w.form + "':");
 77       for (String s : w.gazetteers)
 78         System.out.print(" " + s);
 79       System.out.println();
 80     }
 81     w.gazetteers=new Vector<String>();

which when executed, produces the output:

Overwriting existing gazetteer for word 'Simmons': L-WikiPeople.lst L-WikiPeople.lst(IC)

So, my question thus becomes is this the expected behaviour of your multiword gazetteer algorithm, or is this a bug in the implementation? It seems to me like a bug.

Thanks in advance :)


Tim
http://sydney.edu.au/it/~tdaw3088/



  • [Illinois-ml-nlp-users] LBJ NER -- possible bug in gazetteers code, Tim Dawborn, 11/06/2011

Archive powered by MHonArc 2.6.16.

Top of Page