Lingo phrase matrix

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lingo phrase matrix

Pen Skol
Hi there,

I am not able to understand how the phrase matrix P in Lingo is calculated. In the documents describing the matrix construction, do we have to use the DF of a term from the whole data set or only from the pseudo-documents?

It is not clear how the second column in the following P matrix is calculated:

<img style="margin-right: 0px;" src="" alt="" height="110" width="358">

I understand that after obtaining the frequencies of term and phrases for P matrix, tf-idf and column length normalization is done but even that does not give me the same figure as in this matrix. Would appreciate your help.

Best,

Skold

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Fwd: Lingo phrase matrix

Stanislaw Osinski
Administrator
Hi Skold,

The description in the thesis could indeed have been better. Here are the missing bits:

* for the P matrix calculation, the TF is the frequency of terms in the specific phrase, which means it will be 1 most of the time (unless some word appears more than once in the phrase)
* the IDF factor is taken based on the input documents only (so not including the phrase in question)
* the logarithm in IDF is base 2
* the normalization is for the Euclidean length (so that computing cosine distance is then a simple matrix multiplication)

If you apply all the above, you should get the results shown in the thesis. I recreated the complete calculation for the second column ("Information Retrieval" phrase) in this spreadsheet: https://docs.google.com/spreadsheets/d/1vcmzXlhi-71ivptNzlGx3vlrCHOfsGJgWuU1VuMsW_g/edit#gid=0.

Finally, the implementation of Lingo you'll find in current Carrot2 differs a bit from the one described in our papers. The changes are briefly mentioned in this paper: http://www.ijcte.org/papers/842-IT022.pdf.

Thanks,

Stanislaw



On Fri, May 9, 2014 at 11:41 AM, Pen Skol <[hidden email]> wrote:
Hi there,

I am not able to understand how the phrase matrix P in Lingo is calculated. In the documents describing the matrix construction, do we have to use the DF of a term from the whole data set or only from the pseudo-documents?

It is not clear how the second column in the following P matrix is calculated:



I understand that after obtaining the frequencies of term and phrases for P matrix, tf-idf and column length normalization is done but even that does not give me the same figure as in this matrix. Would appreciate your help.

Best,

Skold

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers




------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Lingo phrase matrix

Pen Skol
HI Stanislaw,

Thank you for you answer. I will go through these.

cheers,



On Mon, May 12, 2014 at 2:04 AM, Stanislaw Osinski <[hidden email]> wrote:
Hi Skold,

The description in the thesis could indeed have been better. Here are the missing bits:

* for the P matrix calculation, the TF is the frequency of terms in the specific phrase, which means it will be 1 most of the time (unless some word appears more than once in the phrase)
* the IDF factor is taken based on the input documents only (so not including the phrase in question)
* the logarithm in IDF is base 2
* the normalization is for the Euclidean length (so that computing cosine distance is then a simple matrix multiplication)

If you apply all the above, you should get the results shown in the thesis. I recreated the complete calculation for the second column ("Information Retrieval" phrase) in this spreadsheet: https://docs.google.com/spreadsheets/d/1vcmzXlhi-71ivptNzlGx3vlrCHOfsGJgWuU1VuMsW_g/edit#gid=0.

Finally, the implementation of Lingo you'll find in current Carrot2 differs a bit from the one described in our papers. The changes are briefly mentioned in this paper: http://www.ijcte.org/papers/842-IT022.pdf.

Thanks,

Stanislaw



On Fri, May 9, 2014 at 11:41 AM, Pen Skol <[hidden email]> wrote:
Hi there,

I am not able to understand how the phrase matrix P in Lingo is calculated. In the documents describing the matrix construction, do we have to use the DF of a term from the whole data set or only from the pseudo-documents?

It is not clear how the second column in the following P matrix is calculated:



I understand that after obtaining the frequencies of term and phrases for P matrix, tf-idf and column length normalization is done but even that does not give me the same figure as in this matrix. Would appreciate your help.

Best,

Skold

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers




------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers



------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers