Skip to Content.
Sympa Menu

illinois-ml-nlp-users - Re: [[Illinois-ml-nlp-users] ] Wikifier Missing Files

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

Re: [[Illinois-ml-nlp-users] ] Wikifier Missing Files


Chronological Thread 
  • From: Lev Ratinov <why2ask AT gmail.com>
  • To: Arvind Rajasekaran <arvind.rajasekaran AT hgdata.com>, illinois-ml-nlp-users <illinois-ml-nlp-users AT cs.uiuc.edu>, Mark Sammons <mssammon AT illinois.edu>
  • Cc: Yuxiao Zhang <yuxiao.zhang AT hgdata.com>
  • Subject: Re: [[Illinois-ml-nlp-users] ] Wikifier Missing Files
  • Date: Mon, 1 Aug 2016 16:03:10 -0400

Thanks, Arvind. I'm a bit surprised because the package is widely used and these files are critical for its operation. I'm CC-ing here Mark Sammons and the UIUC NLP users mailing list. I hope someone can help you getting these files.

I've been out of academia and NLP for quite a few years now and don't maintain my code anymore.



Peace&Love

On Mon, Aug 1, 2016 at 2:52 PM, Arvind Rajasekaran <arvind.rajasekaran AT hgdata.com> wrote:

Dear Dr. Lev Ratinov,

I am a UCSB graduate student. We are trying to use wikifier for our dataset and are trying to update some of the source files. The following files are missing and we are only able to partially understand how they are created. It would be great if you gave us hints on how to create these files. If you shared these files with us, that would be immensely helpful as well.   

 

The files are listed in the order of importance.

 

"./WikiData/Index/CompleteWikipediaIndexVer2.2/";

"./WikiData/Index/SurfaceToTitleIdMap.txt";

"./WikiData/Index/LinkabilityScoresWithGoogleProb.txt";

"./WikiData/Index/TitleIdToSurfaceMap.txt";

categories.tokens.hist.txt

titletoken.hist.txt

WikiArticleWithTopicabilityAndTypes

SurfaceToTitleIdMap.txt

LinkabilityScoresWithGoogleProb.txt

 

 

Regards

Arvind

 

 

 

P. S. We analyzed what fields from the files are being used. Our analysis is below. We would greatly appreciate your help.

 

 

pathToCategoryKeywordCounts = 

§  categories.tokens.hist.txt

§             Field1: count (integer)

§             Field2: Word (String)

§  used in :

§  addCategoryTokensNormalizationData

§   


pathToAllTokensKeywordsCount = 

§   titletoken.hist.txt

            o   field1: count

            o   field2: tokens

§  used in addTokenInfoAndArticleCountInfo



pathToCompleteIndexOldVersion = 

§  completeWikipediaIndexVer2.2

§  used in AggregateData and BuildProtobufferIndices

§  should contains the following fields: (found in BuildProtobufferIndices)

              field:  titleID
           o    field:  titleAppearanceCount
           o    field:  categoriesIDs
           o    field:  categoriesTitles
           o    field:  Text
           o    field:  leftContext
           o    field:  rightContext
           o    field:  linkedFromIDs
           o    field:  linkedFromTitles
           o    field:  linkedToIDs
           o    field:  linkedToTitles
           o    field:  titleForm  (found in BuildWikiTrainingDataFile)


pathTitleToSurfaceFormTextFile = 

§  TitleIdToSurfaceMap.txt ?? ( not sure about structure of fields)

           o    field:  “TitleId” (String)

           o    field:  “Surface” (String) [surface1 number1 surface2 number2 .. ]

           o    field:  ConditionalSurfaceFormProb (Double) (avg(number1, number2, ..))

§  used in IndexSurfaceFormsData

 

pathToSurfaceFormInfoTextFile =

§  SurfaceToTitleIDMap.txt ?? ( not sure about structure of fields)

            o    field: surface form (String) [surface1 number1 surface2 number2 .. ]
            o    field: TitleId (integer)
            o    field: ConditionalTitleAppearance (Double) avg(number1, number2, ..))

§    used in IndexSurfaceFormsData



pathToLinkabilityFile = 

§  LinkabilityScoreWithGoogleProb.txt

           o    Field1: surface form

           o    Field2: var27.nextToken (unused?)

           o    Field3: LinkedAppearanceCount (Integer)

           o    Field4: TotalAppearanceCount (Integer)

           o    Field5: LogProbOnWebGoogle (Double)

§  used in IndexSurfaceFormsData

 

 





Archive powered by MHonArc 2.6.19.

Top of Page