Skip to Content.
Sympa Menu

illinois-ml-nlp-users - [Illinois-ml-nlp-users] LBJ 2.8.0 released!

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

[Illinois-ml-nlp-users] LBJ 2.8.0 released!


Chronological Thread 
  • From: Nicholas Rizzolo <rizzolo AT gmail.com>
  • To: illinois-ml-nlp-users <illinois-ml-nlp-users AT cs.uiuc.edu>
  • Subject: [Illinois-ml-nlp-users] LBJ 2.8.0 released!
  • Date: Thu, 3 Mar 2011 13:48:18 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/illinois-ml-nlp-users>
  • List-id: Support for users of CCG software <illinois-ml-nlp-users.cs.uiuc.edu>

Hi everyone,

Learning Based Java 2.8.0 is on our website and ready for download:
http://cogcomp.cs.illinois.edu/page/software_view/11

This release features some new caching behavior in all classifiers and a completely overhauled, more efficient feature infrastructure.  Be warned that it isn't backwards compatible, however, so any existing learned classifiers you may have will need to be retrained to use it. The Illinois POS Tagger, Chunker, and Coreference Resolution Engine have already been retrained and are available for download on the website as well.


Caching
------------
It used to be the case that every classifier specified in an LBJ source file would check its input object against a single element cache so it could quickly return the same output when it received the same input object twice in a row.  This behavior has now been removed, but it can be re-enabled on a classifier-by-classifier basis using the "cached" keyword just before the left arrow.  For example:

discrete ExpensiveFeature(Word w) cached <- {
 return StaticMethods.myFeature(w);
}

It's a good idea to use the "cached" keyword on even moderately expensive classifiers if they are called multiple times by the same learner.  This often happens when conjunctions are used, e.g.:

discrete WordClassifier(Word w) <-
learn Label
  using
    ExpensiveFeature,
    ExpensiveFeature && OtherFeature1,
    ExpensiveFeature && OtherFeature2
    // Good thing ExpensiveFeature is cached!
    ...
end

The "cachedin" keyword which caches a classifier's prediction in a user-specified field of the input object is still available and can be used simultaneously with "cached".  They may appear in either order just before the left arrow.

Also new in this release is the "cachedinmap" keyword, which provides similar functionality to "cachedin" in that its cache will retain the prediction of each input object as long as that object is still in memory.  But while "cachedin" puts the prediction in a field of the object, "cachedinmap" puts it in a WeakHashMap where the associated key is the input object.  The "cached" and "cachedinmap" keywords may be used simultaneously, but the "cachedin" and "cachedinmap" keywords cannot.


New Feature Infrastructure
------------
All learners should benefit from the new feature infrastructure in memory consumption and execution time, especially the latter whenever their feature indexes need to be loaded from disk (e.g., during testing or in a final end-to-end system).  Indexes are now more compact, especially when they involve many conjunctive features, meaning they take less memory and load from disk more quickly.

Additionally, the user now has an option as to how string data is encoded.  By default, string data as produced by all discrete classifiers is stored in String objects, which contain char arrays.  Java's char data type has 2 bytes.  Thus, if your string data is all ASCII, you can cut memory consumption even further by using the "UTF-8" encoding, like this:

discrete WordClassifier(Word w) <-
learn Label
  using MyFeatures
  ...
  encoding "UTF-8"
  ...
end

Now your string data will be stored in byte arrays within WordClassifier's feature index.  Note that there will be a small penalty in execution time to do the conversions.


As always, questions and comments are welcome.
 - Nick




Archive powered by MHonArc 2.6.16.

Top of Page