Hello,
I have questions about how term-phrase matrix is build and based on this how cluster labels are selected. First question, in TermDocumentMatrixBuilder.buildTermPhraseMatrix method, as far as I can understand only phrases (labels having more than one word) are included to the term-phrase matrix. But, in the buildAlignedMatrix method called from buildTermPhraseMatrix, the code is written like there are also one word length label candidates are handled. If one word length label candidates are not included to the term-phrase matrix, but is handled by a specific case, how they are used in the ClusterBuilder's buildLabels method. I am really confused and feel glad if you can let me understand. I hope I could ask the question the right way :) Second question is about weighting term-phrase matrix. It is weighted just as how term-document matrix is weighted. I wonder the main idea of this. I mean, they could be also weighted as thinking only the term-phrase space, but they didn't. I wonder the reason why. And a more general final question, For all term-document matrix, term-phrase matrix and term-abstract concept matrix, the term base is the same (same count of rows, and same stems), right? Thanks in advance :) Seyfullah |
Hi Seyfullah,
First question, in TermDocumentMatrixBuilder.buildTermPhraseMatrix method, Most probably, we had previously some code that used the single-term case that is now gone, but the condition is still there.
If one word length label candidates are not included to the term-phrase In general, to keep the memory usage low, we don't explicitly build the term-label matrix as described in the paper. Since the matrix would be very sparse (phrases are usually short), we go through columns one by one to get the similarities, rather than create a sparse matrix and use multiplication. Take a look at this bit: https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L192. It first does the above for the single term labels and then for phrase labels.
Second question is about weighting term-phrase matrix. It is weighted just This seemed the most natural choice, I don't remember if I tried different weighting for term-document and term-phrase matrices. It could actually make sense, since the cosine similarity is taken against the reduced term document matrix, so we don't need to be "compatible" witht the original term-document matrix weighting-wise. As long as you keep the columns of the term-phrase matrix normalized, any weighting should do.
And a more general final question, For all term-document matrix, term-phrase Correct, this is what the buildAlignedMatrix() method is supposed to do. Cheers, Staszek ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
I get the idea, thanks. I will restudy it that way. After asking the question, I started thinking deeper about this :) and needed to ask it again. As soon as I finish my weekly assignments, I will come up with a new detailed question :) Thanks :) Seyfullah |
This post was updated on .
In reply to this post by Stanislaw Osinski-3
Hello again :)
When I was talking about same weighting, I didn't mean the usage of same weighting method like tf, linear tfidf or log tfidf. Since termWeighting attribute is specified, the specified way of weighting scheme is used both in term-document matrix and term-phrase matrix. I thought maybe I was misunderstood, that's why I wanted to mention it. What I really want to ask above is when term-phrase matrix is being created, why stem's number of occurence in phrase and number of phrases which contains the stem is not used. And in which approach or heuristic, phrases are handled as using theirs included stems' total frequency and total number of document which their stems' included. You can see below if that is complicated. [ Current: In term-document matrix: weighting(term, document, Documents), In term-phrase matrix: weighting(term, Documents) My curiosity case :) In term-phrase matrix: weighting(term, phrase, Phrases) ] Please forgive me for my complicated manner of telling if my style seems complicated to you :) However, maybe you could have been understand me correctly and the answer will be still the same :) I just wanted to make it clear and eliminate the ambiguity for me. And, additionally I studied the ClusterBuilder buildLabels and all concrete assignLabels method implementations of ILabelAssigner. In buildLabels method currently in here https://github.com/carrot2/carrot2/blob/master/core/carrot2-algorithm-lingo/src/org/carrot2/clustering/lingo/ClusterBuilder.java#L197, for all rows of stemCos matrix, the same labelIndex equals to zero is returned. Because when featureScorer is null, wordLabelIndex array is filled for all words with zeros (line #177), and featureScorer is currently null. This leads all rows of stemCos matrix is penalized as the same penalty value for first word (first label) in AllWords. And here is my question about this: If the stemCos column vectors were only compared to each other in assignLabels methods, this would not be a problem since all the cells are same degree changing. But, since the stemCos column vectors and of course cell scores (stem-abstract concept) are also compared to phrase column vectors and cell scores (phrase-abstract concept), I think this mispenalizing the stemCos row vectors may lead some anormal results, may not it? Similarly, while phraseCos is acquired by multiplying phrase-term matrix and term-abstract concept matrix, stemCos matrix is acquired by filtering rows of term-abstract concept matrix. When matching abstract concepts with phrases or stems by comparing their phraseScores and stemScores like above; may different weightings cause anormal results? (stemCos matrix's weight is as same as term-abstract concept matrix and it is derived from term-document matrix while phraseCos matrix weight is somehow weighted by multiplying phrase-term matrix and term-abstract concept matix) Thanks in advance :) Seyfullah |
Ok, so you suggest that if the document collection contains only one occurrence of, say, "data mining" and 9 occurrences of "data" and 4 occurrences of "mining" separately (not as a phrase), then the term-phrase matrix should be (1, 1) for the phrase "data mining", rather than (10, 5) as it is now? I'm not sure if I considered this approach. On the one hand it seems more accurate, but on the other -- you'd be "flattening" out the variety of values in the matrix because most of the times the occurrence counts of terms in each phrase will be equal. The number of occurrences of the phrase as such does not play any role since the term-phrase matrix is then column-length-normalized (as far as I remember).
And, additionally I studied the ClusterBuilder buildLabels and all concrete This is actually a bug, good catch! wordLabelIndex should be always initialized, regardless of the whether the featureScorer is provided or not. I've committed a fix here:
Thanks, Staszek ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
So, that's why you use weighting(term, Documents) for term-phrase matrix. I guess, I understand, almost :) By the way, in your example, should term-phrase matrix be (10, 5) for the phrase "data mining", or (9, 4). I thank you, you (both of you :)) teach me and I try to contribute :) Also thank you for mentioning me in your commit comment :) And also, I think you saw my first post and answer over that. I actually edited the post and add one more question in the end as " Similarly, while phraseCos is acquired by multiplying phrase-term matrix and term-abstract concept matrix, stemCos matrix is acquired by filtering rows of term-abstract concept matrix. When matching abstract concepts with phrases or stems by comparing their phraseScores and stemScores like above; may different weightings cause anormal results? (stemCos matrix's weight is as same as term-abstract concept matrix and it is derived from term-document matrix while phraseCos matrix weight is somehow weighted by multiplying phrase-term matrix and term-abstract concept matix) " I thought maybe you didn't notice and wanted to ask again. Thanks, Seyfullah ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
Assuming that the "data mining" phrase is in addition to the other "data" and "mining" occurrences, then there are total of 10 and 5 occurrences, respectively, so it should be (10, 5). In other words in case of individual stems, we could all occurrences, including those in phrases.
And also, I think you saw my first post and answer over that. I actually I'm using the e-mail interface not the web one, and the updated don't seem to be e-mailed to the list. Similarly, while phraseCos is acquired by multiplying phrase-term matrix and If we wanted to create a counterpart of phrase-term matrix for individual stems, then the matrix would contain exactly one non-zero value in each column. If we then column-normalized that matrix, we'd end up with exactly one 1.0 value in each column. So regardless of weighting, instead of creating the matrix and the matrix multiplication, we can directly pick the appropriate value from the term-abstract concept matrix.
S. ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
I understand it, thanks :) I understand. If a matrix were created for individual stems and matrix multiplication were done for it to get individual stem - abstract concept matrix, the results would be the same as the current solution, so it is not necessary to create a matrix for that, and matrix multiplication job. I get it now. Thank you. But I don't still understand how stemScore and phraseScore are comparable to each other related to my question. Thanks, Seyfullah ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
But I don't still understand how stemScore and phraseScore are comparable to Think of it as of two documents in the traditional VSM model, one document with only one word, and the other with some more words. If you length-normalize the document vectors, then the one-word document will have just one 1.0 value in its vector, but you still have to compare it to other longer documents. There is the doubt related to the weighting scheme, but the only way to avoid it would be to skip normalization, but then the "cosine" values would not be in the 0...1 range, which would make them more difficult to interpret.
S. ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Carrot2-developers mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/carrot2-developers |
I can understand now :)
Thanks Seyfullah |
Free forum by Nabble | Edit this page |