Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger


Chronological Thread 
  • From: Lev-Arie Ratinov <ratinov2 AT uiuc.edu>
  • To: Jeff Dalton <jdalton AT cs.umass.edu>
  • Cc: illinois-ml-nlp-users AT cs.uiuc.edu
  • Subject: Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger
  • Date: Mon, 21 Mar 2011 00:58:29 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Hm.

Your usage of my code is really ingenuous. I still recommend you to
read a column format with a -c option, but what you're doing look
legit.

Are you invoking
public Chunking chunk(CharSequence cSeq)
on a sentence by sentence basis?

Also - there is something that worries me a great deal- the 71.19F1 of
the baseline. People who implement perceptron-based NER taggers all
report >80F1 baselines. Either my model is wrong, or something else is
wrong.

I'll need a longer time to check this. But in the meanwhile - can you
answer my question:
Are you invoking
public Chunking chunk(CharSequence cSeq)
on a sentence by sentence basis?


On Fri, Mar 18, 2011 at 4:52 PM, Jeff Dalton
<jdalton AT cs.umass.edu>
wrote:
> Thanks for the quick reply!  Attached is a zip file containing the result
> output for the eng.testb dataset with three of the models included in the
> distribution (baselineFeatures, allLayer1, and allFeatures).  Each file in
> the zip has lines that contain: token, corpus tag, and predicted value.  The
> files are in a format that can be processed by conlleval.pl.
> I am attaching the wrapper code that I wrote around LBJ.  It has a main and
> should be runnable.  The only external dependency is that it uses the
> LingPipe 4.0 jar for some Chunk interfaces, which I found useful as a common
> interface for running multiple taggers across a sequence.
> Here is also the output from running the tagger initialization to show the
> differences in the features being used.  This should show that the correct
> configurations are being loaded.
> Reading input: ./data/conll03/eng.testb to output test file:./results/test3
> with tagger:lbj-baseline
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> 95262 words added
> The console output from using the AllLayer1 configuration:
> Adding feature: GazetteersFeatures
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> Adding feature: BrownClusterPaths
> Adding feature: NEShapeTaggerFeatures
> Adding feature: aggregateContext
> Adding feature: aggregateGazetteerMatches
> Adding feature: prevTagsForContext
> 95262 words added
> loading dazzetteers....
> loading gazzetteer:....Data/KnownLists/WikiArtWork.lst
> ...
> loading gazzetteer:....Data/KnownLists/known_place.lst
> found 30 gazetteers
> loading contextless shape classifier
> loading shape classifier
> Done loading shape classifier
> Done- loading contextless shape classifier
> And lastly, the All features console:
> Adding feature: GazetteersFeatures
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> Adding feature: BrownClusterPaths
> Adding feature: NEShapeTaggerFeatures
> Adding feature: aggregateContext
> Adding feature: aggregateGazetteerMatches
> Adding feature: prevTagsForContext
> Adding feature: PatternFeatures
> Adding feature: PredictionsLevel1
> 95262 words added
> loading dazzetteers....
> ...
> - Jeff
> On Fri, Mar 18, 2011 at 5:12 PM, Lev-Arie Ratinov
> <ratinov2 AT uiuc.edu>
> wrote:
>>
>> Hi Jeff. The first step would be trying to double-check that the
>> output of your system as you're running it is identical to the
>> intended output. Can you send me the output of my system?
>>
>> The web data is attached. I may have re-annotated the data, and I've
>> also discovered a bug in my evaluation on the Web portion. So you'll
>> see a different result on the Web data. However, the result on the
>> CoNLL data should be around 90F1 phrase-level.
>>
>>
>>
>> On Fri, Mar 18, 2011 at 3:30 PM, Jeff Dalton
>> <jdalton AT cs.umass.edu>
>> wrote:
>> > Thanks for getting back to me.  ... My responses are inline below.  With
>> > a
>> > bit of hacking I managed to get the ner classifier working.  Here are
>> > the
>> > results of the output for the eng.testb set from conll 2003.  The
>> > evaluation
>> > is done using conlleval.pl and reports phrase-level measures:
>> > processed 46435 tokens with 5645 phrases...
>> > Config/baselineFeatures.config
>> > accuracy:  93.91%; precision:  75.78%; recall:  67.12%; FB1:  71.19
>> > Config/allLayer1.config
>> > accuracy:  97.28%; precision:  88.34%; recall:  85.79%; FB1:  87.05
>> > Config/AllFeatures.config
>> > accuracy:  97.09%; precision:  88.39%; recall:  83.77%; FB1:  86.02
>> > These numbers look different than the ones reported in the paper.  In
>> > particular, the baseline and AllFeatures models performance is very
>> > different.  Is there anything else that I could be missing that needs to
>> > be
>> > done?  It would be good to try to understand what could be going wrong.
>> > I would also be very interested in running the tagger over the web page
>> > data
>> > you report on.  Is it possible to make that dataset available?
>> > Thanks again for the help.
>> > - Jeff
>> >
>> > On Thu, Mar 10, 2011 at 2:13 PM, Lev-Arie Ratinov
>> > <ratinov2 AT uiuc.edu>
>> > wrote:
>> >>
>> >> Hi Jeff.
>> >>
>> >> I've seen this error :
>> >> "ERROR: Can't locate NETaggerLevel1.lc in the class path."
>> >> before. It's an illusive error. The file  NETaggerLevel1.java
>> >> is generated automatically, so I believe that understanding
>> >> it is the wrong way to go.
>> >>
>> >> Are you using Unix/Linux or Windows. The error you're  reporting
>> >> is typical for the Windows systems. One of the  tricks there, is that
>> >> in Windows, your paths should be absolute, e.g:
>> >>
>> >>
>> >>
>> >> D:\temp\TFGTagger\dist\TFGTagger.jar;D:\temp\TFGTagger\lib\LBJ2.jar;D:\temp\TFGTagger\lib\LBJ2Library.jar
>> >> EndToEndSystemTFG.TFGTagger annotate
>> >> D:\temp\TFGTagger\Data\SampleEmails\SampleRawMails
>> >> D:\temp\TFGTagger\Data\SampleEmails\SampleTaggedMails
>> >> D:\temp\TFGTagger\Config\TFG_ForClient_X.config
>> >
>> > I am using 64-bit Ubuntu linux 9.10 with a 64-bit Sun JDK 1.6.xxx.
>> > The issue is that the classifier classes have static initializer methods
>> > which set the lcFilePath member variable, such as the one below:
>> >   static
>> >   {
>> >     lcFilePath = NETypeTagger.class.getResource("NETypeTagger.lc");
>> >     if (lcFilePath == null)
>> >     {
>> >       System.err.println("ERROR: Can't locate NETypeTagger.lc in the
>> > class
>> > path.");
>> >       System.exit(1);
>> >     }
>> >   }
>> > If the lc file isn't there, then the classifier exits.  The lc file
>> > isn't in
>> > the distribution, so I don't see how the file in the path could ever be
>> > present...
>> > To get around this problem I removed the static initializers and set the
>> > lcFilePath in the constructor to the saved classifier locations.  This
>> > allowed them to read the saved models included in the distribution.
>> >
>> >>
>> >> Also, the column format I'm using is a little different from CoNLL03
>> >> annotation format. Below is an example, note that there is shallow
>> >> parse and POS info there, but I don't use it. So you can replace these
>> >> columns by dummy values. Sorry, I don't have a script for that. The
>> >> importance of the column format is that sentence boundaries are
>> >> marked. I have a support for "brackets format",but then you'll rely on
>> >> my own sentence splitting, and you won't be able to reproduce the
>> >> results. Here is the sample data. Please let me know if it solves your
>> >> problems:
>> >>
>> >>
>> >> O       0       0       O       -X-     -DOCSTART-      x       x
>> >> 0
>> >>
>> >> O       0       0       I-NP    NNP     CRICKET x       x       0
>> >> O       0       1       O       :       -       x       x       0
>> >> B-ORG   0       2       I-NP    NNP     LEICESTERSHIRE  x       x
>> >> 0
>> >> O       0       3       I-NP    NNP     TAKE    x       x       0
>> >> O       0       4       I-PP    IN      OVER    x       x       0
>> >> O       0       5       I-NP    NNP     AT      x       x       0
>> >> O       0       6       I-NP    NNP     TOP     x       x       0
>> >> O       0       7       I-NP    NNP     AFTER   x       x       0
>> >> O       0       8       I-NP    NNP     INNINGS x       x       0
>> >> O       0       9       I-NP    NN      VICTORY x       x       0
>> >> O       0       10      O       .       .       x       x       0
>> >>
>> >> B-LOC   0       0       I-NP    NNP     LONDON  x       x       0
>> >> O       0       1       I-NP    CD      1996-08-30      x       x
>> >> 0
>> >>
>> >> B-MISC  0       0       I-NP    NNP     West    x       x       0
>> >> I-MISC  0       1       I-NP    NNP     Indian  x       x       0
>> >> O       0       2       I-NP    NN      all-rounder     x       x
>> >> 0
>> >> B-PER   0       3       I-NP    NNP     Phil    x       x       0
>> >> I-PER   0       4       I-NP    NNP     Simmons x       x       0
>> >> O       0       5       I-VP    VBD     took    x       x       0
>> >> O       0       6       I-NP    CD      four    x       x       0
>> >> O       0       7       I-PP    IN      for     x       x       0
>> >> O       0       8       I-NP    CD      38      x       x       0
>> >> O       0       9       I-PP    IN      on      x       x       0
>> >> O       0       10      I-NP    NNP     Friday  x       x       0
>> >> O       0       11      I-PP    IN      as      x       x       0
>> >> B-ORG   0       12      I-NP    NNP     Leicestershire  x       x
>> >> 0
>> >> O       0       13      I-VP    VBD     beat    x       x       0
>> >> B-ORG   0       14      I-NP    NNP     Somerset        x       x
>> >> 0
>> >
>> > Thanks.  This helps.  I modified the reader code to read the conll
>> > format
>> > and pick out the necessary column values.
>> >
>> >>
>> >>
>> >>
>> >> On Thu, Mar 10, 2011 at 5:44 AM, Jeff Dalton
>> >> <jdalton AT cs.umass.edu>
>> >> wrote:
>> >> > I'm a PhD student at UMass Amherst in the CIIR.  I am trying to run
>> >> > the
>> >> > UIUC
>> >> > NER tagger for a project I am working on.  I downloaded the the
>> >> > distribution
>> >> > from the website.  However, when I try to run it, I get the error:
>> >> >  "ERROR:
>> >> > Can't locate NETaggerLevel1.lc in the class path."  I cannot locate
>> >> > the
>> >> > specified file in the distribution.  It looks like the output of the
>> >> > saved
>> >> > classifier instance.   From the code in NETaggerLevel1.java, it is
>> >> > not
>> >> > clear
>> >> > what the appropriate seeting for lcFilePath is, or how I should
>> >> > create
>> >> > it.
>> >> >  I assume it is created as part of the training process. I tried to
>> >> > run
>> >> > the
>> >> > training command, but it fails in the same location. Could you
>> >> > perhaps
>> >> > shed
>> >> > some light on this mystery?
>> >> > Also, the parser appears to be loading data in Reuters format.  I
>> >> > have
>> >> > the
>> >> > conll data and the data format appears to differ.  Are there scripts
>> >> > to
>> >> > convert between the formats?  Perhaps I am missing a bit of
>> >> > documentation on
>> >> > training.  I'd like to try and reproduce the conll results.
>> >> > I would appreciate any help you could give.
>> >> > Cheers,
>> >> > - Jeff
>> >> > _______________________________________________
>> >> > illinois-ml-nlp-users mailing list
>> >> > illinois-ml-nlp-users AT cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peace&Love
>> >
>> >
>>
>>
>>
>> --
>> Peace&Love
>
>



--
Peace&Love





Archive powered by MHonArc 2.6.16.

Top of Page