illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger

From: Jeff Dalton <jdalton AT cs.umass.edu>
To: Lev-Arie Ratinov <ratinov2 AT uiuc.edu>
Cc: illinois-ml-nlp-users AT cs.uiuc.edu
Subject: Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger
Date: Mon, 21 Mar 2011 11:42:37 -0400
List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

I am invoking the chunk method on a per sentence basis. I added a print statement to show the sequences being tagged in chunk(). Here are the calls for the first few sentences from the test b collection:

Tagging seq: SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .

Tagging seq: Nadim Ladki

Tagging seq: AL-AIN , United Arab Emirates 1996-12-06

Tagging seq: Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .

Tagging seq: But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .

Tagging seq: China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net .

Tagging seq: Oleg Shatskiku made sure of the win in injury time , hitting an unstoppable left foot shot from just outside the area .

I got a similarly strange result using the built-in LbjNERTaggerMain evaluation. I ran:

LbjTagger.NerTagger -test ./data/conll03/eng.testb -c true Config/baselineFeatures.config

And here is the relevant output from the LBJ evaluation:

Phrase-level Acc Level1:

Label Precision Recall F1 LCount PCount

----------------------------------------------

LOC 76.963 68.942 72.732 1919 1719

MISC 71.203 49.724 58.556 905 632

ORG 80.738 36.010 49.806 2491 1111

PER 75.290 42.084 53.990 2773 1550

----------------------------------------------

O 0.000 0.000 0.000 385 3461

----------------------------------------------

Overall 76.556 47.441 58.580 8088 5012

And from the all layer 1 config:

Phrase-level Acc Level2:

Label Precision Recall F1 LCount PCount

----------------------------------------------

LOC 92.617 80.406 86.081 1919 1666

MISC 80.966 61.105 69.647 905 683

ORG 88.450 58.410 70.358 2491 1645

PER 95.524 56.185 70.754 2773 1631

----------------------------------------------

O 0.000 0.000 0.000 284 2747

----------------------------------------------

Overall 90.827 63.168 74.513 8088 5625

I'm trying to determine the source of the differences. Is it a difference in the conll test data format? The model files packaged with the distribution? Or a result of my wrapping / evaluation code?

- Jeff

On Mon, Mar 21, 2011 at 1:58 AM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu> wrote:

Hm.

Your usage of my code is really ingenuous. I still recommend you to
read a column format with a -c option, but what you're doing look
legit.

Are you invoking
public Chunking chunk(CharSequence cSeq)
on a sentence by sentence basis?

Also - there is something that worries me a great deal- the 71.19F1 of
the baseline. People who implement perceptron-based NER taggers all
report >80F1 baselines. Either my model is wrong, or something else is
wrong.

I'll need a longer time to check this. But in the meanwhile - can you
answer my question:
Are you invoking
public Chunking chunk(CharSequence cSeq)
on a sentence by sentence basis?

On Fri, Mar 18, 2011 at 4:52 PM, Jeff Dalton <jdalton AT cs.umass.edu> wrote:
> Thanks for the quick reply! Attached is a zip file containing the result
> output for the eng.testb dataset with three of the models included in the
> distribution (baselineFeatures, allLayer1, and allFeatures). Each file in
> the zip has lines that contain: token, corpus tag, and predicted value. The
> files are in a format that can be processed by conlleval.pl.
> I am attaching the wrapper code that I wrote around LBJ. It has a main and
> should be runnable. The only external dependency is that it uses the
> LingPipe 4.0 jar for some Chunk interfaces, which I found useful as a common
> interface for running multiple taggers across a sequence.
> Here is also the output from running the tagger initialization to show the
> differences in the features being used. This should show that the correct
> configurations are being loaded.
> Reading input: ./data/conll03/eng.testb to output test file:./results/test3
> with tagger:lbj-baseline
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> 95262 words added
> The console output from using the AllLayer1 configuration:
> Adding feature: GazetteersFeatures
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> Adding feature: BrownClusterPaths
> Adding feature: NEShapeTaggerFeatures
> Adding feature: aggregateContext
> Adding feature: aggregateGazetteerMatches
> Adding feature: prevTagsForContext
> 95262 words added
> loading dazzetteers....
> loading gazzetteer:....Data/KnownLists/WikiArtWork.lst
> ...
> loading gazzetteer:....Data/KnownLists/known_place.lst
> found 30 gazetteers
> loading contextless shape classifier
> loading shape classifier
> Done loading shape classifier
> Done- loading contextless shape classifier
> And lastly, the All features console:
> Adding feature: GazetteersFeatures
> Adding feature: Forms
> Adding feature: Capitalization
> Adding feature: WordTypeInformation
> Adding feature: Affixes
> Adding feature: PreviousTag1
> Adding feature: PreviousTag2
> Adding feature: BrownClusterPaths
> Adding feature: NEShapeTaggerFeatures
> Adding feature: aggregateContext
> Adding feature: aggregateGazetteerMatches
> Adding feature: prevTagsForContext
> Adding feature: PatternFeatures
> Adding feature: PredictionsLevel1
> 95262 words added
> loading dazzetteers....
> ...
> - Jeff
> On Fri, Mar 18, 2011 at 5:12 PM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu> wrote:
>>
>> Hi Jeff. The first step would be trying to double-check that the
>> output of your system as you're running it is identical to the
>> intended output. Can you send me the output of my system?
>>
>> The web data is attached. I may have re-annotated the data, and I've
>> also discovered a bug in my evaluation on the Web portion. So you'll
>> see a different result on the Web data. However, the result on the
>> CoNLL data should be around 90F1 phrase-level.
>>
>>
>>
>> On Fri, Mar 18, 2011 at 3:30 PM, Jeff Dalton <jdalton AT cs.umass.edu> wrote:
>> > Thanks for getting back to me. ... My responses are inline below. With
>> > a
>> > bit of hacking I managed to get the ner classifier working. Here are
>> > the
>> > results of the output for the eng.testb set from conll 2003. The
>> > evaluation
>> > is done using conlleval.pl and reports phrase-level measures:
>> > processed 46435 tokens with 5645 phrases...
>> > Config/baselineFeatures.config
>> > accuracy: 93.91%; precision: 75.78%; recall: 67.12%; FB1: 71.19
>> > Config/allLayer1.config
>> > accuracy: 97.28%; precision: 88.34%; recall: 85.79%; FB1: 87.05
>> > Config/AllFeatures.config
>> > accuracy: 97.09%; precision: 88.39%; recall: 83.77%; FB1: 86.02
>> > These numbers look different than the ones reported in the paper. In
>> > particular, the baseline and AllFeatures models performance is very
>> > different. Is there anything else that I could be missing that needs to
>> > be
>> > done? It would be good to try to understand what could be going wrong.
>> > I would also be very interested in running the tagger over the web page
>> > data
>> > you report on. Is it possible to make that dataset available?
>> > Thanks again for the help.
>> > - Jeff
>> >
>> > On Thu, Mar 10, 2011 at 2:13 PM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu>
>> > wrote:
>> >>
>> >> Hi Jeff.
>> >>
>> >> I've seen this error :
>> >> "ERROR: Can't locate NETaggerLevel1.lc in the class path."
>> >> before. It's an illusive error. The file NETaggerLevel1.java
>> >> is generated automatically, so I believe that understanding
>> >> it is the wrong way to go.
>> >>
>> >> Are you using Unix/Linux or Windows. The error you're reporting
>> >> is typical for the Windows systems. One of the tricks there, is that
>> >> in Windows, your paths should be absolute, e.g:
>> >>
>> >>
>> >>
>> >> D:\temp\TFGTagger\dist\TFGTagger.jar;D:\temp\TFGTagger\lib\LBJ2.jar;D:\temp\TFGTagger\lib\LBJ2Library.jar
>> >> EndToEndSystemTFG.TFGTagger annotate
>> >> D:\temp\TFGTagger\Data\SampleEmails\SampleRawMails
>> >> D:\temp\TFGTagger\Data\SampleEmails\SampleTaggedMails
>> >> D:\temp\TFGTagger\Config\TFG_ForClient_X.config
>> >
>> > I am using 64-bit Ubuntu linux 9.10 with a 64-bit Sun JDK 1.6.xxx.
>> > The issue is that the classifier classes have static initializer methods
>> > which set the lcFilePath member variable, such as the one below:
>> >   static
>> >   {
>> >    lcFilePath = NETypeTagger.class.getResource("NETypeTagger.lc");
>> >    if (lcFilePath == null)
>> >    {
>> >    System.err.println("ERROR: Can't locate NETypeTagger.lc in the
>> > class
>> > path.");
>> >    System.exit(1);
>> >    }
>> >   }
>> > If the lc file isn't there, then the classifier exits. The lc file
>> > isn't in
>> > the distribution, so I don't see how the file in the path could ever be
>> > present...
>> > To get around this problem I removed the static initializers and set the
>> > lcFilePath in the constructor to the saved classifier locations. This
>> > allowed them to read the saved models included in the distribution.
>> >
>> >>
>> >> Also, the column format I'm using is a little different from CoNLL03
>> >> annotation format. Below is an example, note that there is shallow
>> >> parse and POS info there, but I don't use it. So you can replace these
>> >> columns by dummy values. Sorry, I don't have a script for that. The
>> >> importance of the column format is that sentence boundaries are
>> >> marked. I have a support for "brackets format",but then you'll rely on
>> >> my own sentence splitting, and you won't be able to reproduce the
>> >> results. Here is the sample data. Please let me know if it solves your
>> >> problems:
>> >>
>> >>
>> >> O 0 0 O -X- -DOCSTART- x x
>> >> 0
>> >>
>> >> O 0 0 I-NP NNP CRICKET x x 0
>> >> O 0 1 O : - x x 0
>> >> B-ORG 0 2 I-NP NNP LEICESTERSHIRE x x
>> >> 0
>> >> O 0 3 I-NP NNP TAKE x x 0
>> >> O 0 4 I-PP IN OVER x x 0
>> >> O 0 5 I-NP NNP AT x x 0
>> >> O 0 6 I-NP NNP TOP x x 0
>> >> O 0 7 I-NP NNP AFTER x x 0
>> >> O 0 8 I-NP NNP INNINGS x x 0
>> >> O 0 9 I-NP NN VICTORY x x 0
>> >> O 0 10 O . . x x 0
>> >>
>> >> B-LOC 0 0 I-NP NNP LONDON x x 0
>> >> O 0 1 I-NP CD 1996-08-30 x x
>> >> 0
>> >>
>> >> B-MISC 0 0 I-NP NNP West x x 0
>> >> I-MISC 0 1 I-NP NNP Indian x x 0
>> >> O 0 2 I-NP NN all-rounder x x
>> >> 0
>> >> B-PER 0 3 I-NP NNP Phil x x 0
>> >> I-PER 0 4 I-NP NNP Simmons x x 0
>> >> O 0 5 I-VP VBD took x x 0
>> >> O 0 6 I-NP CD four x x 0
>> >> O 0 7 I-PP IN for x x 0
>> >> O 0 8 I-NP CD 38 x x 0
>> >> O 0 9 I-PP IN on x x 0
>> >> O 0 10 I-NP NNP Friday x x 0
>> >> O 0 11 I-PP IN as x x 0
>> >> B-ORG 0 12 I-NP NNP Leicestershire x x
>> >> 0
>> >> O 0 13 I-VP VBD beat x x 0
>> >> B-ORG 0 14 I-NP NNP Somerset x x
>> >> 0
>> >
>> > Thanks. This helps. I modified the reader code to read the conll
>> > format
>> > and pick out the necessary column values.
>> >
>> >>
>> >>
>> >>
>> >> On Thu, Mar 10, 2011 at 5:44 AM, Jeff Dalton <jdalton AT cs.umass.edu>
>> >> wrote:
>> >> > I'm a PhD student at UMass Amherst in the CIIR. I am trying to run
>> >> > the
>> >> > UIUC
>> >> > NER tagger for a project I am working on. I downloaded the the
>> >> > distribution
>> >> > from the website. However, when I try to run it, I get the error:
>> >> > "ERROR:
>> >> > Can't locate NETaggerLevel1.lc in the class path." I cannot locate
>> >> > the
>> >> > specified file in the distribution. It looks like the output of the
>> >> > saved
>> >> > classifier instance. From the code in NETaggerLevel1.java, it is
>> >> > not
>> >> > clear
>> >> > what the appropriate seeting for lcFilePath is, or how I should
>> >> > create
>> >> > it.
>> >> > I assume it is created as part of the training process. I tried to
>> >> > run
>> >> > the
>> >> > training command, but it fails in the same location. Could you
>> >> > perhaps
>> >> > shed
>> >> > some light on this mystery?
>> >> > Also, the parser appears to be loading data in Reuters format. I
>> >> > have
>> >> > the
>> >> > conll data and the data format appears to differ. Are there scripts
>> >> > to
>> >> > convert between the formats? Perhaps I am missing a bit of
>> >> > documentation on
>> >> > training. I'd like to try and reproduce the conll results.
>> >> > I would appreciate any help you could give.
>> >> > Cheers,
>> >> > - Jeff
>> >> > _______________________________________________
>> >> > illinois-ml-nlp-users mailing list
>> >> > illinois-ml-nlp-users AT cs.uiuc.edu
>> >> > http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peace&Love
>> >
>> >
>>
>>
>>
>> --
>> Peace&Love
>
>

--
Peace&Love

[Illinois-ml-nlp-users] Running the LBJ NER Tagger, Jeff Dalton, 03/10/2011
- Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Lev-Arie Ratinov, 03/10/2011
  - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Jeff Dalton, 03/18/2011
    - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Lev-Arie Ratinov, 03/18/2011
      - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Jeff Dalton, 03/18/2011
        
        Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Lev-Arie Ratinov, 03/21/2011
        
        Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Jeff Dalton, 03/21/2011
        
        Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Lev-Arie Ratinov, 03/21/2011
        Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Jeff Dalton, 03/22/2011
        Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger, Lev-Arie Ratinov, 03/22/2011