Skip to Content.
Sympa Menu

illinois-ml-nlp-users - RE: [[Illinois-ml-nlp-users] ] Wikifier Missing Files

illinois-ml-nlp-users AT lists.cs.illinois.edu

Subject: Support for users of CCG software closed 7-27-20

List archive

RE: [[Illinois-ml-nlp-users] ] Wikifier Missing Files


Chronological Thread 
  • From: "Sammons, Mark" <mssammon AT illinois.edu>
  • To: Lev Ratinov <why2ask AT gmail.com>, Arvind Rajasekaran <arvind.rajasekaran AT hgdata.com>, illinois-ml-nlp-users <illinois-ml-nlp-users AT cs.uiuc.edu>
  • Cc: Yuxiao Zhang <yuxiao.zhang AT hgdata.com>, "Tsai, Chen-Tse" <ctsai12 AT illinois.edu>, "Upadhyay, Shyam" <upadhya3 AT illinois.edu>
  • Subject: RE: [[Illinois-ml-nlp-users] ] Wikifier Missing Files
  • Date: Wed, 3 Aug 2016 16:46:03 +0000
  • Accept-language: en-US

Hi, Arvind.

Please let me know:

-- where you downloaded this software
-- what version it lists in its pom.xml file
-- what errors you saw, and what command you ran that generated these errors.

Thanks,

Mark



From: arie.ratinov AT gmail.com [arie.ratinov AT gmail.com] on behalf of Lev Ratinov [why2ask AT gmail.com]
Sent: Monday, August 01, 2016 3:03 PM
To: Arvind Rajasekaran; illinois-ml-nlp-users; Sammons, Mark
Cc: Yuxiao Zhang
Subject: Re: Wikifier Missing Files

Thanks, Arvind. I'm a bit surprised because the package is widely used and these files are critical for its operation. I'm CC-ing here Mark Sammons and the UIUC NLP users mailing list. I hope someone can help you getting these files.

I've been out of academia and NLP for quite a few years now and don't maintain my code anymore.



Peace&Love

On Mon, Aug 1, 2016 at 2:52 PM, Arvind Rajasekaran <arvind.rajasekaran AT hgdata.com> wrote:

Dear Dr. Lev Ratinov,

I am a UCSB graduate student. We are trying to use wikifier for our dataset and are trying to update some of the source files. The following files are missing and we are only able to partially understand how they are created. It would be great if you gave us hints on how to create these files. If you shared these files with us, that would be immensely helpful as well.   

 

The files are listed in the order of importance.

 

"./WikiData/Index/CompleteWikipediaIndexVer2.2/";

"./WikiData/Index/SurfaceToTitleIdMap.txt";

"./WikiData/Index/LinkabilityScoresWithGoogleProb.txt";

"./WikiData/Index/TitleIdToSurfaceMap.txt";

categories.tokens.hist.txt

titletoken.hist.txt

WikiArticleWithTopicabilityAndTypes

SurfaceToTitleIdMap.txt

LinkabilityScoresWithGoogleProb.txt

 

 

Regards

Arvind

 

 

 

P. S. We analyzed what fields from the files are being used. Our analysis is below. We would greatly appreciate your help.

 

 

pathToCategoryKeywordCounts = 

§  categories.tokens.hist.txt

§             Field1: count (integer)

§             Field2: Word (String)

§  used in :

§  addCategoryTokensNormalizationData

§   


pathToAllTokensKeywordsCount = 

§   titletoken.hist.txt

            o   field1: count

            o   field2: tokens

§  used in addTokenInfoAndArticleCountInfo



pathToCompleteIndexOldVersion = 

§  completeWikipediaIndexVer2.2

§  used in AggregateData and BuildProtobufferIndices

§  should contains the following fields: (found in BuildProtobufferIndices)

              field:  titleID
           o    field:  titleAppearanceCount
           o    field:  categoriesIDs
           o    field:  categoriesTitles
           o    field:  Text
           o    field:  leftContext
           o    field:  rightContext
           o    field:  linkedFromIDs
           o    field:  linkedFromTitles
           o    field:  linkedToIDs
           o    field:  linkedToTitles
           o    field:  titleForm  (found in BuildWikiTrainingDataFile)


pathTitleToSurfaceFormTextFile = 

§  TitleIdToSurfaceMap.txt ?? ( not sure about structure of fields)

           o    field:  “TitleId” (String)

           o    field:  “Surface” (String) [surface1 number1 surface2 number2 .. ]

           o    field:  ConditionalSurfaceFormProb (Double) (avg(number1, number2, ..))

§  used in IndexSurfaceFormsData

 

pathToSurfaceFormInfoTextFile =

§  SurfaceToTitleIDMap.txt ?? ( not sure about structure of fields)

            o    field: surface form (String) [surface1 number1 surface2 number2 .. ]
            o    field: TitleId (integer)
            o    field: ConditionalTitleAppearance (Double) avg(number1, number2, ..))

§    used in IndexSurfaceFormsData



pathToLinkabilityFile = 

§  LinkabilityScoreWithGoogleProb.txt

           o    Field1: surface form

           o    Field2: var27.nextToken (unused?)

           o    Field3: LinkedAppearanceCount (Integer)

           o    Field4: TotalAppearanceCount (Integer)

           o    Field5: LogProbOnWebGoogle (Double)

§  used in IndexSurfaceFormsData

 

 





Archive powered by MHonArc 2.6.19.

Top of Page