Term-Phrase Matrix and Label Building Phase

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Term-Phrase Matrix and Label Building Phase

seyfullahd
Hello,

I have questions about how term-phrase matrix is build and based on this how cluster labels are selected.

First question, in TermDocumentMatrixBuilder.buildTermPhraseMatrix method, as far as I can understand only phrases (labels having more than one word) are included to the term-phrase matrix. But, in the buildAlignedMatrix method called from buildTermPhraseMatrix, the code is written like there are also one word length label candidates are handled.

If one word length label candidates are not included to the term-phrase matrix, but is handled by a specific  case, how they are used in the ClusterBuilder's  buildLabels method. I am really confused and feel glad if you can let me understand. I hope I could ask the question the right way :)

Second question is about weighting term-phrase matrix. It is weighted just as how term-document matrix is weighted. I wonder the main idea of this. I mean, they could be also weighted as thinking only the term-phrase space, but they didn't. I wonder the reason why.

And a more general final question, For all term-document matrix, term-phrase matrix and term-abstract concept matrix, the term base is the same (same count of rows, and same stems), right?

Thanks in advance :)

Seyfullah
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

Stanislaw Osinski-3
Hi Seyfullah,

First question, in TermDocumentMatrixBuilder.buildTermPhraseMatrix method,
as far as I can understand only phrases (labels having more than one word)
are included to the term-phrase matrix. But, in the buildAlignedMatrix
method called from buildTermPhraseMatrix, the code is written like there are
also one word length label candidates are handled.

Most probably, we had previously some code that used the single-term case that is now gone, but the condition is still there.

 
If one word length label candidates are not included to the term-phrase
matrix, but is handled by a specific  case, how they are used in the
ClusterBuilder's  buildLabels method. I am really confused and feel glad if
you can let me understand. I hope I could ask the question the right way :)

In general, to keep the memory usage low, we don't explicitly build the term-label matrix as described in the paper. Since the matrix would be very sparse (phrases are usually short), we go through columns one by one to get the similarities, rather than create a sparse matrix and use multiplication.

Take a look at this bit: https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L192. It first does the above for the single term labels and then for phrase labels.

 
Second question is about weighting term-phrase matrix. It is weighted just
as how term-document matrix is weighted. I wonder the main idea of this. I
mean, they could be also weighted as thinking only the term-phrase space,
but they didn't. I wonder the reason why.

This seemed the most natural choice, I don't remember if I tried different weighting for term-document and term-phrase matrices. It could actually make sense, since the cosine similarity is taken against the reduced term document matrix, so we don't need to be "compatible" witht the original term-document matrix weighting-wise. As long as you keep the columns of the term-phrase matrix normalized, any weighting should do.

 
And a more general final question, For all term-document matrix, term-phrase
matrix and term-abstract concept matrix, the term base is the same (same
count of rows, and same stems), right?

Correct, this is what the buildAlignedMatrix() method is supposed to do.

Cheers,

Staszek


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

seyfullahd
Stanislaw Osinski-3 wrote
In general, to keep the memory usage low, we don't explicitly build the
term-label matrix as described in the paper. Since the matrix would be very
sparse (phrases are usually short), we go through columns one by one to get
the similarities, rather than create a sparse matrix and use multiplication.

Take a look at this bit:
https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L192.
It first does the above for the single term labels and then for phrase
labels.
I get the idea, thanks. I will restudy it that way.

Stanislaw Osinski-3 wrote
This seemed the most natural choice, I don't remember if I tried different
weighting for term-document and term-phrase matrices. It could actually
make sense, since the cosine similarity is taken against the reduced term
document matrix, so we don't need to be "compatible" witht the original
term-document matrix weighting-wise. As long as you keep the columns of the
term-phrase matrix normalized, any weighting should do.
After asking the question, I started thinking deeper about this :) and needed to ask it again. As soon as I finish my weekly assignments, I will come up with a new detailed question :) Thanks :)

Seyfullah
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

seyfullahd
This post was updated on .
In reply to this post by Stanislaw Osinski-3
Hello again :)

When I was talking about same weighting, I didn't mean the usage of same weighting method like tf, linear tfidf or log tfidf. Since termWeighting attribute is specified, the specified way of weighting scheme is used both in term-document matrix and term-phrase matrix. I thought maybe I was misunderstood, that's why I wanted to mention it.

Seyfullah wrote
Second question is about weighting term-phrase matrix. It is weighted just as how term-document matrix is weighted. I wonder the main idea of this. I mean, they could be also weighted as thinking only the term-phrase space, but they didn't. I wonder the reason why.
What I really want to ask above is when term-phrase matrix is being created, why stem's number of occurence in phrase and number of phrases which contains the stem is not used. And in which approach or heuristic, phrases are handled as using theirs included stems' total frequency and total number of document which their stems' included.  You can see below if that is complicated.

[
  Current:
  In term-document matrix: weighting(term, document, Documents),
  In term-phrase matrix: weighting(term, Documents)

  My curiosity case :)
  In term-phrase matrix: weighting(term, phrase, Phrases)

 ]

Please forgive me for my complicated manner of telling if my style seems complicated to you :)
However, maybe you could have been understand me correctly and the answer will be still the same :) I just wanted to make it clear and eliminate the ambiguity for me.

And, additionally I studied the ClusterBuilder buildLabels and all concrete assignLabels method implementations of ILabelAssigner. In buildLabels method currently in here https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L197, for all rows of stemCos matrix, the same labelIndex equals to zero is returned. Because when featureScorer is null, wordLabelIndex array is filled for all words with zeros (line #177), and featureScorer is currently null. This leads all rows of stemCos matrix is penalized as the same penalty value for first word (first label) in AllWords. And here is my question about this: If the stemCos column vectors were only compared to each other in assignLabels methods, this would not be a problem since all the cells are same degree changing. But, since the stemCos column vectors and of course cell scores (stem-abstract concept) are also compared to phrase column vectors and cell scores (phrase-abstract concept), I think this mispenalizing the stemCos row vectors may lead some anormal results, may not it?

Similarly, while phraseCos is acquired by multiplying phrase-term matrix and term-abstract concept matrix, stemCos matrix is acquired by filtering rows of term-abstract concept matrix. When matching abstract concepts with phrases or stems by comparing their phraseScores and stemScores like above; may different weightings cause anormal results? (stemCos matrix's  weight is as same as term-abstract concept matrix and it is derived from term-document matrix while phraseCos matrix weight is somehow weighted by multiplying phrase-term matrix and term-abstract concept matix)
 
Thanks in advance :)

Seyfullah
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

Stanislaw Osinski-3

What I really want to ask above is when term-phrase matrix is being created,
why stem's number of occurence in phrase and number of phrases which
contains the stem is not used. And in which approach or heuristic, phrases
are handled as using theirs included stems' total frequency and total number
of document which their stems' included.  You can see below if that is
complicated.

[
  Current:
  In term-document matrix: weighting(term, document, Documents),
  In term-phrase matrix: weighting(term)

  My curiosity case :)
  In term-phrase matrix: weighting(term, phrase, Phrases)

 ]

Please forgive me for my complicated manner of telling if my style seems
complicated to you :)
However, maybe you could have been understand me correctly and the answer
will be still the same :) I just wanted to make it clear and eliminate the
ambiguity for me.

Ok, so you suggest that if the document collection contains only one occurrence of, say, "data mining" and 9 occurrences of "data" and 4 occurrences of "mining" separately (not as a phrase), then the term-phrase matrix should be (1, 1) for the phrase "data mining", rather than (10, 5) as it is now? I'm not sure if I considered this approach. On the one hand it seems more accurate, but on the other -- you'd be "flattening" out the variety of values in the matrix because most of the times the occurrence counts of terms in each phrase will be equal. The number of occurrences of the phrase as such does not play any role since the term-phrase matrix is then column-length-normalized (as far as I remember).

 
And, additionally I studied the ClusterBuilder buildLabels and all concrete
assignLabels method implementations of ILabelAssigner. In buildLabels method
currently in here
https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L197,
for all rows of stemCos matrix, the same labelIndex equals to zero is
returned. Because when featureScorer is null, wordLabelIndex array is filled
for all words with zeros (line #177), and featureScorer is currently null.
This leads all rows of stemCos matrix is penalized as the same penalty value
for first word (first label) in AllWords. And here is my question about
this: If the stemCos column vectors were only compared to each other in
assignLabels methods, this would not be a problem since all the cells are
same degree changing. But, since the stemCos column vectors and of course
cell scores (stem-abstract concept) are also compared to phrase column
vectors and cell scores (phrase-abstract concept), I think this
mispenalizing the stemCos row vectors may lead some anormal results, may not
it?

This is actually a bug, good catch! wordLabelIndex should be always initialized, regardless of the whether the featureScorer is provided or not. I've committed a fix here:


Thanks,

Staszek


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

seyfullahd
Stanislaw Osinski-3 wrote
Ok, so you suggest that if the document collection contains only one
occurrence of, say, "data mining" and 9 occurrences of "data" and 4
occurrences of "mining" separately (not as a phrase), then the term-phrase
matrix should be (1, 1) for the phrase "data mining", rather than (10, 5)
as it is now? I'm not sure if I considered this approach. On the one hand
it seems more accurate, but on the other -- you'd be "flattening" out the
variety of values in the matrix because most of the times the occurrence
counts of terms in each phrase will be equal. The number of occurrences of
the phrase as such does not play any role since the term-phrase matrix is
then column-length-normalized (as far as I remember).
So, that's why you use weighting(term, Documents) for term-phrase matrix. I guess, I understand, almost :) By the way, in your example, should term-phrase matrix be (10, 5) for the phrase "data mining", or (9, 4).

Stanislaw Osinski-3 wrote
This is actually a bug, good catch! wordLabelIndex should be always
initialized, regardless of the whether the featureScorer is provided or
not. I've committed a fix here:

https://github.com/carrot2/carrot2/commit/7631559f62c17f191d7c70d95c3e7bef86e45a82
I thank you, you (both of you :)) teach me and I try to contribute :)
Also thank you for mentioning me in your commit comment :)

And also, I think you saw my first post and answer over that. I actually edited the post and add one more question in the end as

"
Similarly, while phraseCos is acquired by multiplying phrase-term matrix and term-abstract concept matrix, stemCos matrix is acquired by filtering rows of term-abstract concept matrix. When matching abstract concepts with phrases or stems by comparing their phraseScores and stemScores like above; may different weightings cause anormal results? (stemCos matrix's  weight is as same as term-abstract concept matrix and it is derived from term-document matrix while phraseCos matrix weight is somehow weighted by multiplying phrase-term matrix and term-abstract concept matix)
"

I thought maybe you didn't notice and wanted to ask again.

Thanks,

Seyfullah

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

Stanislaw Osinski-3

So, that's why you use weighting(term, Documents) for term-phrase matrix. I
guess, I understand, almost :) By the way, in your example, should
term-phrase matrix be (10, 5) for the phrase "data mining", or (9, 4).

Assuming that the "data mining" phrase is in addition to the other "data" and "mining" occurrences, then there are total of 10 and 5 occurrences, respectively, so it should be (10, 5). In other words in case of individual stems, we could all occurrences, including those in phrases.


And also, I think you saw my first post and answer over that. I actually
edited the post and add one more question in the end as

I'm using the e-mail interface not the web one, and the updated don't seem to be e-mailed to the list.

 
Similarly, while phraseCos is acquired by multiplying phrase-term matrix and
term-abstract concept matrix, stemCos matrix is acquired by filtering rows
of term-abstract concept matrix. When matching abstract concepts with
phrases or stems by comparing their phraseScores and stemScores like above;
may different weightings cause anormal results? (stemCos matrix's  weight is
as same as term-abstract concept matrix and it is derived from term-document
matrix while phraseCos matrix weight is somehow weighted by multiplying
phrase-term matrix and term-abstract concept matix)

If we wanted to create a counterpart of phrase-term matrix for individual stems, then the matrix would contain exactly one non-zero value in each column. If we then column-normalized that matrix, we'd end up with exactly one 1.0 value in each column. So regardless of weighting, instead of creating the matrix and the matrix multiplication, we can directly pick the appropriate value from the term-abstract concept matrix.

S.

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

seyfullahd
Stanislaw Osinski-3 wrote
Assuming that the "data mining" phrase is in addition to the other "data"
and "mining" occurrences, then there are total of 10 and 5 occurrences,
respectively, so it should be (10, 5). In other words in case of individual
stems, we could all occurrences, including those in phrases.
I understand it, thanks :)


Stanislaw Osinski-3 wrote
If we wanted to create a counterpart of phrase-term matrix for individual
stems, then the matrix would contain exactly one non-zero value in each
column. If we then column-normalized that matrix, we'd end up with exactly
one 1.0 value in each column. So regardless of weighting, instead of
creating the matrix and the matrix multiplication, we can directly pick the
appropriate value from the term-abstract concept matrix.
I understand. If a matrix were created for individual stems and matrix multiplication were done for it to get individual stem - abstract concept matrix, the results would be the same as the current solution, so it is not necessary to create a matrix for that, and matrix multiplication job. I get it now. Thank you.

But I don't still understand how stemScore and phraseScore are comparable to each other related to my question.

Thanks,

Seyfullah

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

Stanislaw Osinski-3

But I don't still understand how stemScore and phraseScore are comparable to
each other related to my question.

Think of it as of two documents in the traditional VSM model, one document with only one word, and the other with some more words. If you length-normalize the document vectors, then the one-word document will have just one 1.0 value in its vector, but you still have to compare it to other longer documents. There is the doubt related to the weighting scheme, but the only way to avoid it would be to skip normalization, but then the "cosine" values would not be in the 0...1 range, which would make them more difficult to interpret.

S.

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Carrot2-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/carrot2-developers
Reply | Threaded
Open this post in threaded view
|

Re: Term-Phrase Matrix and Label Building Phase

seyfullahd
I can understand now :)

Thanks

Seyfullah