Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger


Chronological Thread 
  • From: Jeff Dalton <jdalton AT cs.umass.edu>
  • To: Lev-Arie Ratinov <ratinov2 AT uiuc.edu>
  • Cc: illinois-ml-nlp-users AT cs.uiuc.edu
  • Subject: Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger
  • Date: Fri, 18 Mar 2011 16:30:40 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Thanks for getting back to me.  ... My responses are inline below.  With a bit of hacking I managed to get the ner classifier working.  Here are the results of the output for the eng.testb set from conll 2003.  The evaluation is done using conlleval.pl and reports phrase-level measures:

processed 46435 tokens with 5645 phrases...

Config/baselineFeatures.config
accuracy:  93.91%; precision:  75.78%; recall:  67.12%; FB1:  71.19

Config/allLayer1.config
accuracy:  97.28%; precision:  88.34%; recall:  85.79%; FB1:  87.05

Config/AllFeatures.config
accuracy:  97.09%; precision:  88.39%; recall:  83.77%; FB1:  86.02

These numbers look different than the ones reported in the paper.  In particular, the baseline and AllFeatures models performance is very different.  Is there anything else that I could be missing that needs to be done?  It would be good to try to understand what could be going wrong.

I would also be very interested in running the tagger over the web page data you report on.  Is it possible to make that dataset available?

Thanks again for the help.

- Jeff

On Thu, Mar 10, 2011 at 2:13 PM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu> wrote:
Hi Jeff.

I've seen this error :
"ERROR: Can't locate NETaggerLevel1.lc in the class path."
before. It's an illusive error. The file  NETaggerLevel1.java
is generated automatically, so I believe that understanding
it is the wrong way to go.

Are you using Unix/Linux or Windows. The error you're  reporting
is typical for the Windows systems. One of the  tricks there, is that
in Windows, your paths should be absolute, e.g:

D:\temp\TFGTagger\dist\TFGTagger.jar;D:\temp\TFGTagger\lib\LBJ2.jar;D:\temp\TFGTagger\lib\LBJ2Library.jar
EndToEndSystemTFG.TFGTagger annotate
D:\temp\TFGTagger\Data\SampleEmails\SampleRawMails
D:\temp\TFGTagger\Data\SampleEmails\SampleTaggedMails
D:\temp\TFGTagger\Config\TFG_ForClient_X.config

I am using 64-bit Ubuntu linux 9.10 with a 64-bit Sun JDK 1.6.xxx.

The issue is that the classifier classes have static initializer methods which set the lcFilePath member variable, such as the one below:

  static
  {
    lcFilePath = NETypeTagger.class.getResource("NETypeTagger.lc");

    if (lcFilePath == null)
    {
      System.err.println("ERROR: Can't locate NETypeTagger.lc in the class path.");
      System.exit(1);
    }
  }

If the lc file isn't there, then the classifier exits.  The lc file isn't in the distribution, so I don't see how the file in the path could ever be present...

To get around this problem I removed the static initializers and set the lcFilePath in the constructor to the saved classifier locations.  This allowed them to read the saved models included in the distribution.

 
Also, the column format I'm using is a little different from CoNLL03
annotation format. Below is an example, note that there is shallow
parse and POS info there, but I don't use it. So you can replace these
columns by dummy values. Sorry, I don't have a script for that. The
importance of the column format is that sentence boundaries are
marked. I have a support for "brackets format",but then you'll rely on
my own sentence splitting, and you won't be able to reproduce the
results. Here is the sample data. Please let me know if it solves your
problems:


O       0       0       O       -X-     -DOCSTART-      x       x       0

O       0       0       I-NP    NNP     CRICKET x       x       0
O       0       1       O       :       -       x       x       0
B-ORG   0       2       I-NP    NNP     LEICESTERSHIRE  x       x       0
O       0       3       I-NP    NNP     TAKE    x       x       0
O       0       4       I-PP    IN      OVER    x       x       0
O       0       5       I-NP    NNP     AT      x       x       0
O       0       6       I-NP    NNP     TOP     x       x       0
O       0       7       I-NP    NNP     AFTER   x       x       0
O       0       8       I-NP    NNP     INNINGS x       x       0
O       0       9       I-NP    NN      VICTORY x       x       0
O       0       10      O       .       .       x       x       0

B-LOC   0       0       I-NP    NNP     LONDON  x       x       0
O       0       1       I-NP    CD      1996-08-30      x       x       0

B-MISC  0       0       I-NP    NNP     West    x       x       0
I-MISC  0       1       I-NP    NNP     Indian  x       x       0
O       0       2       I-NP    NN      all-rounder     x       x       0
B-PER   0       3       I-NP    NNP     Phil    x       x       0
I-PER   0       4       I-NP    NNP     Simmons x       x       0
O       0       5       I-VP    VBD     took    x       x       0
O       0       6       I-NP    CD      four    x       x       0
O       0       7       I-PP    IN      for     x       x       0
O       0       8       I-NP    CD      38      x       x       0
O       0       9       I-PP    IN      on      x       x       0
O       0       10      I-NP    NNP     Friday  x       x       0
O       0       11      I-PP    IN      as      x       x       0
B-ORG   0       12      I-NP    NNP     Leicestershire  x       x       0
O       0       13      I-VP    VBD     beat    x       x       0
B-ORG   0       14      I-NP    NNP     Somerset        x       x       0

Thanks.  This helps.  I modified the reader code to read the conll format and pick out the necessary column values.
 




On Thu, Mar 10, 2011 at 5:44 AM, Jeff Dalton <jdalton AT cs.umass.edu> wrote:
> I'm a PhD student at UMass Amherst in the CIIR.  I am trying to run the UIUC
> NER tagger for a project I am working on.  I downloaded the the distribution
> from the website.  However, when I try to run it, I get the error:  "ERROR:
> Can't locate NETaggerLevel1.lc in the class path."  I cannot locate the
> specified file in the distribution.  It looks like the output of the saved
> classifier instance.   From the code in NETaggerLevel1.java, it is not clear
> what the appropriate seeting for lcFilePath is, or how I should create it.
>  I assume it is created as part of the training process. I tried to run the
> training command, but it fails in the same location. Could you perhaps shed
> some light on this mystery?
> Also, the parser appears to be loading data in Reuters format.  I have the
> conll data and the data format appears to differ.  Are there scripts to
> convert between the formats?  Perhaps I am missing a bit of documentation on
> training.  I'd like to try and reproduce the conll results.
> I would appreciate any help you could give.
> Cheers,
> - Jeff
> _______________________________________________
> illinois-ml-nlp-users mailing list
> illinois-ml-nlp-users AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users
>
>



--
Peace&Love




Archive powered by MHonArc 2.6.16.

Top of Page