Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger


Chronological Thread 
  • From: Jeff Dalton <jdalton AT cs.umass.edu>
  • To: Lev-Arie Ratinov <ratinov2 AT uiuc.edu>
  • Cc: illinois-ml-nlp-users AT cs.uiuc.edu
  • Subject: Re: [Illinois-ml-nlp-users] Running the LBJ NER Tagger
  • Date: Fri, 18 Mar 2011 17:52:01 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Thanks for the quick reply!  Attached is a zip file containing the result output for the eng.testb dataset with three of the models included in the distribution (baselineFeatures, allLayer1, and allFeatures).  Each file in the zip has lines that contain: token, corpus tag, and predicted value.  The files are in a format that can be processed by conlleval.pl.  

I am attaching the wrapper code that I wrote around LBJ.  It has a main and should be runnable.  The only external dependency is that it uses the LingPipe 4.0 jar for some Chunk interfaces, which I found useful as a common interface for running multiple taggers across a sequence. 

Here is also the output from running the tagger initialization to show the differences in the features being used.  This should show that the correct configurations are being loaded.

Reading input: ./data/conll03/eng.testb to output test file:./results/test3 with tagger:lbj-baseline
Adding feature: Forms
Adding feature: Capitalization
Adding feature: WordTypeInformation
Adding feature: Affixes
Adding feature: PreviousTag1
Adding feature: PreviousTag2
95262 words added

The console output from using the AllLayer1 configuration:
Adding feature: GazetteersFeatures
Adding feature: Forms
Adding feature: Capitalization
Adding feature: WordTypeInformation
Adding feature: Affixes
Adding feature: PreviousTag1
Adding feature: PreviousTag2
Adding feature: BrownClusterPaths
Adding feature: NEShapeTaggerFeatures
Adding feature: aggregateContext
Adding feature: aggregateGazetteerMatches
Adding feature: prevTagsForContext
95262 words added
loading dazzetteers....
loading gazzetteer:....Data/KnownLists/WikiArtWork.lst
...
loading gazzetteer:....Data/KnownLists/known_place.lst
found 30 gazetteers
loading contextless shape classifier
loading shape classifier
Done loading shape classifier
Done- loading contextless shape classifier

And lastly, the All features console:
Adding feature: GazetteersFeatures
Adding feature: Forms
Adding feature: Capitalization
Adding feature: WordTypeInformation
Adding feature: Affixes
Adding feature: PreviousTag1
Adding feature: PreviousTag2
Adding feature: BrownClusterPaths
Adding feature: NEShapeTaggerFeatures
Adding feature: aggregateContext
Adding feature: aggregateGazetteerMatches
Adding feature: prevTagsForContext
Adding feature: PatternFeatures
Adding feature: PredictionsLevel1
95262 words added
loading dazzetteers....
... 

- Jeff

On Fri, Mar 18, 2011 at 5:12 PM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu> wrote:
Hi Jeff. The first step would be trying to double-check that the
output of your system as you're running it is identical to the
intended output. Can you send me the output of my system?

The web data is attached. I may have re-annotated the data, and I've
also discovered a bug in my evaluation on the Web portion. So you'll
see a different result on the Web data. However, the result on the
CoNLL data should be around 90F1 phrase-level.



On Fri, Mar 18, 2011 at 3:30 PM, Jeff Dalton <jdalton AT cs.umass.edu> wrote:
> Thanks for getting back to me.  ... My responses are inline below.  With a
> bit of hacking I managed to get the ner classifier working.  Here are the
> results of the output for the eng.testb set from conll 2003.  The evaluation
> is done using conlleval.pl and reports phrase-level measures:
> processed 46435 tokens with 5645 phrases...
> Config/baselineFeatures.config
> accuracy:  93.91%; precision:  75.78%; recall:  67.12%; FB1:  71.19
> Config/allLayer1.config
> accuracy:  97.28%; precision:  88.34%; recall:  85.79%; FB1:  87.05
> Config/AllFeatures.config
> accuracy:  97.09%; precision:  88.39%; recall:  83.77%; FB1:  86.02
> These numbers look different than the ones reported in the paper.  In
> particular, the baseline and AllFeatures models performance is very
> different.  Is there anything else that I could be missing that needs to be
> done?  It would be good to try to understand what could be going wrong.
> I would also be very interested in running the tagger over the web page data
> you report on.  Is it possible to make that dataset available?
> Thanks again for the help.
> - Jeff
>
> On Thu, Mar 10, 2011 at 2:13 PM, Lev-Arie Ratinov <ratinov2 AT uiuc.edu> wrote:
>>
>> Hi Jeff.
>>
>> I've seen this error :
>> "ERROR: Can't locate NETaggerLevel1.lc in the class path."
>> before. It's an illusive error. The file  NETaggerLevel1.java
>> is generated automatically, so I believe that understanding
>> it is the wrong way to go.
>>
>> Are you using Unix/Linux or Windows. The error you're  reporting
>> is typical for the Windows systems. One of the  tricks there, is that
>> in Windows, your paths should be absolute, e.g:
>>
>>
>> D:\temp\TFGTagger\dist\TFGTagger.jar;D:\temp\TFGTagger\lib\LBJ2.jar;D:\temp\TFGTagger\lib\LBJ2Library.jar
>> EndToEndSystemTFG.TFGTagger annotate
>> D:\temp\TFGTagger\Data\SampleEmails\SampleRawMails
>> D:\temp\TFGTagger\Data\SampleEmails\SampleTaggedMails
>> D:\temp\TFGTagger\Config\TFG_ForClient_X.config
>
> I am using 64-bit Ubuntu linux 9.10 with a 64-bit Sun JDK 1.6.xxx.
> The issue is that the classifier classes have static initializer methods
> which set the lcFilePath member variable, such as the one below:
>   static
>   {
>     lcFilePath = NETypeTagger.class.getResource("NETypeTagger.lc");
>     if (lcFilePath == null)
>     {
>       System.err.println("ERROR: Can't locate NETypeTagger.lc in the class
> path.");
>       System.exit(1);
>     }
>   }
> If the lc file isn't there, then the classifier exits.  The lc file isn't in
> the distribution, so I don't see how the file in the path could ever be
> present...
> To get around this problem I removed the static initializers and set the
> lcFilePath in the constructor to the saved classifier locations.  This
> allowed them to read the saved models included in the distribution.
>
>>
>> Also, the column format I'm using is a little different from CoNLL03
>> annotation format. Below is an example, note that there is shallow
>> parse and POS info there, but I don't use it. So you can replace these
>> columns by dummy values. Sorry, I don't have a script for that. The
>> importance of the column format is that sentence boundaries are
>> marked. I have a support for "brackets format",but then you'll rely on
>> my own sentence splitting, and you won't be able to reproduce the
>> results. Here is the sample data. Please let me know if it solves your
>> problems:
>>
>>
>> O       0       0       O       -X-     -DOCSTART-      x       x       0
>>
>> O       0       0       I-NP    NNP     CRICKET x       x       0
>> O       0       1       O       :       -       x       x       0
>> B-ORG   0       2       I-NP    NNP     LEICESTERSHIRE  x       x       0
>> O       0       3       I-NP    NNP     TAKE    x       x       0
>> O       0       4       I-PP    IN      OVER    x       x       0
>> O       0       5       I-NP    NNP     AT      x       x       0
>> O       0       6       I-NP    NNP     TOP     x       x       0
>> O       0       7       I-NP    NNP     AFTER   x       x       0
>> O       0       8       I-NP    NNP     INNINGS x       x       0
>> O       0       9       I-NP    NN      VICTORY x       x       0
>> O       0       10      O       .       .       x       x       0
>>
>> B-LOC   0       0       I-NP    NNP     LONDON  x       x       0
>> O       0       1       I-NP    CD      1996-08-30      x       x       0
>>
>> B-MISC  0       0       I-NP    NNP     West    x       x       0
>> I-MISC  0       1       I-NP    NNP     Indian  x       x       0
>> O       0       2       I-NP    NN      all-rounder     x       x       0
>> B-PER   0       3       I-NP    NNP     Phil    x       x       0
>> I-PER   0       4       I-NP    NNP     Simmons x       x       0
>> O       0       5       I-VP    VBD     took    x       x       0
>> O       0       6       I-NP    CD      four    x       x       0
>> O       0       7       I-PP    IN      for     x       x       0
>> O       0       8       I-NP    CD      38      x       x       0
>> O       0       9       I-PP    IN      on      x       x       0
>> O       0       10      I-NP    NNP     Friday  x       x       0
>> O       0       11      I-PP    IN      as      x       x       0
>> B-ORG   0       12      I-NP    NNP     Leicestershire  x       x       0
>> O       0       13      I-VP    VBD     beat    x       x       0
>> B-ORG   0       14      I-NP    NNP     Somerset        x       x       0
>
> Thanks.  This helps.  I modified the reader code to read the conll format
> and pick out the necessary column values.
>
>>
>>
>>
>> On Thu, Mar 10, 2011 at 5:44 AM, Jeff Dalton <jdalton AT cs.umass.edu> wrote:
>> > I'm a PhD student at UMass Amherst in the CIIR.  I am trying to run the
>> > UIUC
>> > NER tagger for a project I am working on.  I downloaded the the
>> > distribution
>> > from the website.  However, when I try to run it, I get the error:
>> >  "ERROR:
>> > Can't locate NETaggerLevel1.lc in the class path."  I cannot locate the
>> > specified file in the distribution.  It looks like the output of the
>> > saved
>> > classifier instance.   From the code in NETaggerLevel1.java, it is not
>> > clear
>> > what the appropriate seeting for lcFilePath is, or how I should create
>> > it.
>> >  I assume it is created as part of the training process. I tried to run
>> > the
>> > training command, but it fails in the same location. Could you perhaps
>> > shed
>> > some light on this mystery?
>> > Also, the parser appears to be loading data in Reuters format.  I have
>> > the
>> > conll data and the data format appears to differ.  Are there scripts to
>> > convert between the formats?  Perhaps I am missing a bit of
>> > documentation on
>> > training.  I'd like to try and reproduce the conll results.
>> > I would appreciate any help you could give.
>> > Cheers,
>> > - Jeff
>> > _______________________________________________
>> > illinois-ml-nlp-users mailing list
>> > illinois-ml-nlp-users AT cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users
>> >
>> >
>>
>>
>>
>> --
>> Peace&Love
>
>



--
Peace&Love

Attachment: lbj-results.zip
Description: Zip archive

package edu.umass.ciir.mb.ner;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;
import java.util.Vector;

import lbj.NETaggerLevel1;
import lbj.NETaggerLevel2;
import LBJ2.classify.Classifier;
import LBJ2.parse.LinkedVector;
import LbjTagger.BracketFileManager;
import LbjTagger.NETester;
import LbjTagger.NEWord;
import LbjTagger.Parameters;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.chunk.IoTagChunkCodec;
import com.aliasi.chunk.TagChunkCodec;
import com.aliasi.tag.StringTagging;

/**
 * A wrapper around the LBJ NER classifier that makes it 
 * compatible with producing Chunk objects compatible with 
 * the LingPipe framework.
 * 
 * @author jdalton
 *
 */
public class LbjTaggerWrapper implements Chunker {

	private NETaggerLevel1 m_tagger1;
	private NETaggerLevel2 m_tagger2;

	private final TagChunkCodec mCodec  = new IoTagChunkCodec();

	int numSentencesHandled = 0;
	public LbjTaggerWrapper(String configFilePath) {

		Parameters.readConfigAndLoadExternalData(configFilePath);

		System.out.println("loading the tagger");
		m_tagger1 = new NETaggerLevel1();
		m_tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+".level1");
		m_tagger2 = new NETaggerLevel2();
		m_tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+".level2");
		System.out.println("Done- loading the tagger");
	}


	@Override
	public Chunking chunk(CharSequence cSeq) {
		numSentencesHandled++;
		if (numSentencesHandled % 10==0) {
			System.out.println("processed" + numSentencesHandled);
		}
		Vector<LinkedVector> data=BracketFileManager.parseText(cSeq.toString());
		NETester.annotateBothLevels(data,m_tagger1,m_tagger2);

		//System.out.println(cSeq);
		if (data.size() == 0) {
			System.out.println("No data for sequence: " + cSeq);
		}

		ArrayList<String> terms = new ArrayList<String>();
		ArrayList<String> tags = new ArrayList<String>();
		List<Integer> tokenStarts = new ArrayList<Integer>();
		List<Integer> tokenEnds = new ArrayList<Integer>();

		int curPos = 0;

		for (int i=0; i < data.size(); i++) {
			LinkedVector vector = data.elementAt(i);
			String[] predictions=new String[vector.size()];
			String[] words=new String[vector.size()];
			for(int j=0;j<vector.size();j++){
				String prediction = bilou2bio(((NEWord)vector.get(j)).neTypeLevel2);
				predictions[j] = prediction.replace("B-", "I-");
				words[j]=((NEWord)vector.get(j)).form;
			}

			terms.addAll(Arrays.asList(words));
			tags.addAll(Arrays.asList(predictions));

			String s = cSeq.toString();
			for (String tok : words) {
				
				int start;
				// the lbj tokenizer sometimes converts '' into " internally.  
				// transfer them back to '' in the original sentence.
				if (tok.equals("\"")) {
					int s1 = s.indexOf("''",curPos);
					int s2 = s.indexOf(tok,curPos);
					if ( s1 > -1) {
						if (s2 > -1) {
							start = Math.min(s1,s2);
						} else {
							start = s1;
						}
					} else {
						start = s.indexOf(tok,curPos);
					}
				} else {
					start = s.indexOf(tok,curPos);
				}
				
				if (start < 0) {
					throw new IllegalStateException("Unable to find token: " + tok + " in string: " + cSeq.toString());
				}
				int end = start + tok.length();
				tokenStarts.add(start);
				tokenEnds.add(end);
				curPos = end;
			}
		}
		StringTagging strTagging = new StringTagging(terms, tags, cSeq, tokenStarts, tokenEnds);
		Chunking chunking = mCodec.toChunking(strTagging);
		//dumpChunks(chunking);
		return chunking;

	}

	private void dumpChunks(Chunking chunking) {
		Set<Chunk> chunkSet = chunking.chunkSet();
		for (Chunk c : chunkSet) {
			System.out.println("Chunk: " + chunking.charSequence().subSequence(c.start(),c.end()) + "\t" +c.type());
		}
		
	}


	public static String bilou2bio(String prediction){
		if(Parameters.taggingScheme.equalsIgnoreCase(Parameters.BILOU)){
			if(prediction.startsWith("U-"))
				prediction="I-"+prediction.substring(2);
			if(prediction.startsWith("L-"))
				prediction="I-"+prediction.substring(2);
		}
		return prediction;
	}

	@Override
	public Chunking chunk(char[] cs, int start, int end) {
		return chunk(new String(cs));
	}

	public static void main(String[] args) {
		LbjTaggerWrapper wrapper = new LbjTaggerWrapper(args[0]);
		Chunking chunking = wrapper.chunk("I am a test of Michael Jordan's basketball ability.");
		Set<Chunk> chunkSet = chunking.chunkSet();
		for (Chunk c : chunkSet) {
			System.out.println(c.toString());
		}
	}

}



Archive powered by MHonArc 2.6.16.

Top of Page